This repository contains files for data processing and continued pretraining to reproduce the paper "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code".
🤗 Hugging Face | 📑 Paper | 📖 Project Page
Our dataset, MathCode-Pile is released at MathCode-Pile. Currently, only part of the dataset is released. You can generate the rest using the code in this repository. The full dataset will be released later.
The MathCoder2 models are as follows:
Model Name | Huggingface Link |
---|---|
MathCoder2-Llama-3-8B | 🤗 link |
MathCoder2-DeepSeekMath-7B | 🤗 link |
MathCoder2-Mistral-7B | 🤗 link |
MathCoder2-CodeLlama-7B | 🤗 link |
FastText Models:
Model Name | Huggingface Link |
---|---|
fastText-cc-en-filter_round1 | 🤗 link |
fastText-cc-en-filter_round2 | 🤗 link |
Although utilizing existing open-source code in the pretraining phase can enhance the mathematical reasoning abilities of LLMs, such code often lacks accompanying natural language explanations or context. This might hinder the model's ability to fully understand them. In this paper, we propose a novel method for generating large amounts of mathematical code accompanied by corresponding natural language reasoning steps, which are extracted from math-related pretraining texts. Different from the existing math-related code, our generated code is paired with natural language reasoning steps, making the code more comprehensible. Also, as our code is generated based on math-related texts, they are all highly related to mathematical reasoning. When used in pretraining, the mathematical code paired with reasoning steps facilitates LLMs' understanding of math-related pretraining texts, as it effectively captures the underlying reasoning process. Furthermore, this data enhances the model's potential to be finetuned for TIR reasoning.
Our data processing pipeline consists of two key steps: (1) carefully curating a robust basic dataset for pretraining, and (2) generating paired reasoning steps and mathematical code by extracting LaTeX expressions and their context, translating the extracted information into Python code snippets, executing the generated code snippets, and verifying their correctness.
The results of MathCoder2 are presented below.
The documentations for generating each part of the MathCode-Pile dataset are as follows:
- filtered-OpenWebMath
- filtered-CC-En-math
- synthetic data
- code using math packages
- mathematical textbooks
- translated mathematical code
The documentation for decontamination is at: decontamination.
The documentation for training is at: training
The documentation for testing is at: evaluation.
If you find this repository helpful, please consider citing our papers:
@misc{lu2024mathcoder2bettermathreasoning,
title={MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code},
author={Zimu Lu and Aojun Zhou and Ke Wang and Houxing Ren and Weikang Shi and Junting Pan and Mingjie Zhan and Hongsheng Li},
year={2024},
eprint={2410.08196},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.08196},
}
@inproceedings{
wang2024mathcoder,
title={MathCoder: Seamless Code Integration in {LLM}s for Enhanced Mathematical Reasoning},
author={Zimu Lu and Aojun Zhou and Zimu Lu and Sichun Luo and Weikang Shi and Renrui Zhang and Linqi Song and Mingjie Zhan and Hongsheng Li},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=z8TW0ttBPp}
}