MathGAP: An Evaluation Benchmark for LLMs’ Mathematical Reasoning Using Controlled Proof Depth, Width, and Complexity for Out-of-Distribution Tasks
Machine learning has considerably improved in evaluating large language models (LLMs) for their mathematical reasoning abilities, especially in handling complex...