【Introduction】 In a 32-question advanced mathematics test, LLM performed excellently, scoring an average of 90.4 (on a 100-point scale). GPT-4o and Mistral AI were almost flawless! Vector calculations, geometric analysis, integral calculations, optimization problems, and other advanced topics were easily handled by the high-level AI models. The research found that re-prompting is crucial for improving accuracy.
Friends may leave you, and brothers may betray you.
But mathematics will not, mathematics just won't.
Those who struggle with advanced mathematics can deeply relate to the above meme.
It's as if mathematics just won't be learned: no matter how eloquent you are, or how physically fit, when faced with calculus, what can you do?
So, are the large language models (LLMs) also a one-subject student?
The latest research used 32 test questions, totaling 320 points, covering 4 major topics: vector calculation, geometric analysis, integral calculation, and optimization problems, to evaluate the performance of AI models in advanced mathematics.
Overall, the results show that LLM's performance in advanced mathematics is not bad, with an average score of 90.4 (on a 100-point scale):
-ChatGPT 4o and Mistral AI performed stably and accurately on different types of math problems, demonstrating strong mathematical reasoning ability and reliability.
-Gemini Advanced (1.5 Pro) and Meta AI performed relatively weaker on certain integral and optimization problems, showing areas that need targeted optimization.
Among them, ChatGPT 4o and Mistral AI performed excellently, tying for first place:
A total of 7 AI models participated in the test: ChatGPT 4o, Gemini Advanced (1.5 Pro), Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity.
In addition, the research found that re-prompting is crucial for improving accuracy.
In some cases, the model answered incorrectly on the first attempt, but was able to correct the answer after being re-prompted, indicating that improving the interaction method can enhance the model's problem-solving performance.
The new research has important reference value for the choice of LLM in mathematics education and practice applications for educators, researchers, and developers, and also provides key insights for further optimization and development of LLM technology.
Paper link: https://arxiv.org/abs/2503.03960
What surprises can LLM bring to calculus?
Calculus, with its complex concepts and rigorous problem-solving methods, is an ideal field to test the limits of LLM capabilities.
Solving calculus problems not only requires computational accuracy, but also demands the model to have a deep understanding of mathematical principles, logical reasoning abilities, and the ability to apply theoretical concepts to practical problems.
The new research selected problems covering multiple important topics in calculus, including vector analysis, geometric interpretation, integral calculation, and optimization problems.
By evaluating the performance of these models in the problem-solving process, the aim is to identify their strengths, weaknesses, and areas for improvement, thereby driving the development of more powerful and reliable LLM technology.
As educational institutions and industries increasingly explore the application of AI technology, it is crucial to deeply understand the capabilities and limitations of LLM in handling complex mathematical problems.
The analysis results of the new research are valuable for multiple groups, including educators developing AI-assisted learning tools, researchers committed to enhancing LLM capabilities, and practitioners hoping to deploy these technologies in practical applications.
In addition, this study also responds to the growing demand for systematic evaluation of AI models in professional fields.
Through a carefully designed set of test questions and a detailed scoring system, this study provides a methodological framework for evaluating the performance of LLM in solving mathematical problems.
Furthermore, the study introduced a re-prompting mechanism and conducted in-depth analysis of error patterns to explore the learning capabilities of the models and potential strategies to improve their accuracy and reliability. These research results help to better understand the strengths and limitations of LLM in mathematical reasoning, and provide valuable references for future optimization.
Research Methodology
Large language models (LLMs) are diverse in their architectures and training methods, each with its own specialties:
ChatGPT 4o is known for its advanced natural language understanding and generation capabilities;
Gemini Advanced with 1.5 Pro aims to handle high-performance language tasks;
Copilot Pro focuses on programming and mathematical problem-solving;
Claude 3.5 Sonnet emphasizes accurate and context-aware text generation;
Meta AI aims to provide multi-functional language understanding and generation;
Mistral AI is renowned for its efficient and precise language processing capabilities;
Perplexity is designed specifically for complex problem-solving and reasoning tasks.
Now the question is: how well do these models perform in advanced mathematics?
This evaluation involved 32 test questions, totaling 320 points.
If the model provided the correct answer on the first attempt, it received 10 points; if it found the correct answer on the second attempt, it received 5 points.
The test questions covered multiple calculus topics, including: vector calculation and geometric interpretation, integral calculation and its applications, optimization problems and constrained optimization, differential equations and their applications, as well as advanced calculus concepts (such as Green's theorem, line integrals, etc.).
The evaluation of the models was primarily based on two core criteria:
Accuracy - whether the model's answer is correct.
Step-by-Step Explanation - whether the model can provide clear and correct problem-solving steps.
To further test the models' error correction capabilities, this study introduced a re-prompting mechanism.
If the model's initial answer was incorrect, it would be re-prompted to solve the problem, and the corrected answer would be evaluated. This mechanism helps to more comprehensively analyze the model's problem-solving abilities and its capacity to learn from mistakes and correct its answers.
Test Results
Overall, the average score of all LLMs is 90.4 (on a 100-point scale), showing a relatively strong overall performance. Among them, ChatGPT 4o and Mistral AI scored 310, tying for first place, with the specific results as follows:
Models like ChatGPT 4o and Mistral AI have demonstrated high accuracy and precision, while other models struggled more with certain types of problems.
For example, on vector decomposition problems, all models correctly calculated the projection of a vector onto another vector and the orthogonal component, indicating their high accuracy and stability in handling vector decomposition problems.
Find the projection of the vector u=3i−5j+2k onto the vector v=7i+j−2k, and the component of u orthogonal to v, showing all steps.
However, the models still exhibit clear differences in their problem-solving capabilities for specific questions.
Here is the English translation of the text, with the specified terms retained and not translated:For example, in the case of finding an orthogonal vector, only Claude 3.5 Sonnet initially answered incorrectly, but corrected the error after a re-prompt.
Find a unit vector that is orthogonal to the vectors u=⟨4,-3,1⟩ and v=⟨2,5,3⟩, and show all the steps.
In the field of optimization, Google's Gemini Adavnced with 1.5 Pro directly failed, and after being prompted about the error, it did not correct it, continuing to make mistakes twice, exposing its specific weakness in optimization problems.
Check the relative extrema and saddle points of the function: f(x, y)=-5x^2+4xy-y^2+16x+10. Provide all the steps.
Meta AI answered incorrectly on an integral problem, while ChatGPT 4o, after re-prompting, almost never makes mistakes.
Overall, the large language models show differences in their performance on calculus tests.
For the specific test results of the other 20+ questions, please refer to the original text.
Result Analysis
The analysis of the performance of LLMs on calculus tests reveals several key insights and trends, which are crucial for understanding their capabilities and limitations in solving mathematical problems.
ChatGPT 4o and Mistral AI tied for first place with a score of 96.9%, showing the best performance.
ChatGPT 4o performed excellently across a wide range of problem types, demonstrating its strong mathematical reasoning abilities. Mistral AI particularly excelled in vector calculus and multivariable calculus. Gemini Advanced, Claude 3.5 Sonnet, and Meta AI had the same performance, scoring 87.5%.
Advantages of LLMs
Stability on simple problems: ChatGPT 4o and Mistral AI exhibited consistent accuracy in solving basic problems (such as vector calculations, geometric interpretations, and basic differentiation), indicating their robustness and reliability in handling fundamental calculus concepts.
Effectiveness of re-prompting: In multiple tests, some models initially gave incorrect answers, but successfully corrected them after re-prompting. This suggests that iterative questioning and feedback mechanisms can effectively improve the models' performance.
High accuracy in specific domains: On problems involving direction cosines, partial derivatives, and line integrals, all models provided correct answers, indicating their strong consensus and understanding of these calculus topics.
Limitations of LLMs
Complex integral calculations: The models generally performed poorly in handling complex integrals (such as iterated integrals, triple integrals, and area calculations under curves), indicating that their integral solving capabilities still need improvement.
Optimization problems: Some models (particularly Gemini Advanced with 1.5 Pro) showed weaker performance in solving optimization problems, especially in the identification of relative extrema and saddle points, suggesting that their optimization techniques still require further enhancement.
Persistent errors: Certain models repeatedly made mistakes on specific problems. For example, Meta AI had significant difficulties with integral calculations, while Gemini Advanced with 1.5 Pro performed poorly in gradient computations. These persistent errors suggest that their algorithms may need further optimization.
Importance of Re-prompting
The study emphasizes the crucial role of re-prompting mechanisms in improving the accuracy of problem-solving.
Multiple models successfully corrected their initial incorrect answers through re-prompting.
This indicates that iterative questioning and feedback can significantly enhance the models' problem-solving abilities, especially for complex problems where the initial error rate is higher, and re-prompting can improve the final accuracy.
Implications for LLM Development
This study provides detailed analyses of the performance of various models, offering valuable insights for the continuous optimization of LLM technology.
The results reveal the strengths and weaknesses of current LLMs, providing a clear roadmap for future targeted improvements, particularly in the areas of complex integral calculations, optimization problem-solving, and the precision of gradient computations.
If developers can optimize these weaknesses, it will help to enhance the overall performance and reliability of LLMs in solving mathematical problems.
The findings of this study are of significant importance to educators, researchers, and developers, especially in the context of mathematics education and practical applications:
High-performing models (such as ChatGPT 4o and Mistral AI): They have demonstrated strong mathematical problem-solving capabilities, making them reliable mathematical assistance tools that can be applied in the education sector.
Limitations of other models: The study points out areas for improvement, providing a reference for the further optimization of LLM technology. In the future, as LLMs continue to progress in the field of mathematics, they are expected to become more powerful and reliable tools for mathematical education and problem-solving, playing a crucial role in teaching, research, and industrial applications.
References:
https://arxiv.org/abs/2503.03960
This article is from the WeChat public account "New Intelligence", edited by KingHZ, and published with authorization from 36Kr.