Researchers at MIT have developed an AI model that can solve problems used in university-level mathematics courses. The system uses the OpenAI Codex engine to generate programs that output the problem solution, including graphs and plots, achieving an accuracy of 81% on the MATH benchmark dataset as well as on real problems from MIT courses.
The team described their work in a paper published in the Proceedings of the National Academy of Sciences (PNAS). The researchers found that for approximately 70% of problems, simply adding prompts to the problem description then feeding the combined text into Codex would generate a program that produced the correct answer. A "few-shot" learning scheme where similar problems were fed to the model for context could solve an additional 10% of problems. The model is also able to generate mathematics problems which human evaluators judged to be on-par with problems created by humans. According to the MIT team:
The success of this work confirms that programs serve as a good representation and computation environment for solving math problems. Since our approach requires no additional training, it is easily scalable. This work addresses significant pedagogical challenges, bringing substantial benefits to higher education like curriculum design and analysis tools and automatic content generation.
Large pre-trained language models such as GPT-3 and Google's PaLM have shown some "zero-shot" capabilities in mathematics, particularly around arithmetic and question-answering. Until recently, however, according to Berkeley's Dan Hendrycks, these models usually achieve only about a 5% accuracy on problem-solving benchmarks. Earlier this year, InfoQ covered Google's Minerva, which uses a mathematics-specific dataset to fine-tune a generic PaLM language model. Minerva can generate answers that include text as well as LaTeX markup for equations, and achieved an average score of 50.3% on the MATH benchmark.
Instead of using a language model to directly generate a solution, the MIT researchers chose to use OpenAI's Codex model to generate computer programs which can output a solution, which can include numeric values, equations, and even graphs. For most problems, simply pre-pending the string "write a program" and placing the problem text within Pythonic triple-quotes is sufficient to prompt Codex to generate the correct program.
For cases where simple prompting does not work, the researchers developed a few-shot learning workflow. First, an embedding is calculated for all problems in the dataset. Then, of the solved problems, the top five most similar to the unsolved one are used, along with their solution code, as example inputs to the model. This method can bring the overall accuracy to 81%.
The model can also generate new problem questions. Several questions from the dataset are concatenated as a numbered list, which is used as a prompt to Codex, which responds with a generated question as the next item in the list. To evaluate the quality of the generated problems, the researchers surveyed students who had taken the relevant mathematics courses at MIT. The students ranked the generated problems as "similar in difficulty" to human-created ones, although they ranked the human-created problems as "slightly more appropriate" for the MIT courses.
The MIT code as well as a dataset of problems and resulting answers are available on GitHub.