MathCoder2

This repository contains files for data processing and continued pretraining to reproduce the paper "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code".

🤗 Hugging Face | 📑 Paper ｜ 📖 Project Page

Dataset and Models

Our dataset, MathCode-Pile is released at MathCode-Pile. Currently, only part of the dataset is released. You can generate the rest using the code in this repository. The full dataset will be released later.

The MathCoder2 models are as follows:

Model Name	Huggingface Link
MathCoder2-Llama-3-8B	🤗 link
MathCoder2-DeepSeekMath-7B	🤗 link
MathCoder2-Mistral-7B	🤗 link
MathCoder2-CodeLlama-7B	🤗 link

FastText Models:

Model Name	Huggingface Link
fastText-cc-en-filter_round1	🤗 link
fastText-cc-en-filter_round2	🤗 link

Introduction

Although utilizing existing open-source code in the pretraining phase can enhance the mathematical reasoning abilities of LLMs, such code often lacks accompanying natural language explanations or context. This might hinder the model's ability to fully understand them. In this paper, we propose a novel method for generating large amounts of mathematical code accompanied by corresponding natural language reasoning steps, which are extracted from math-related pretraining texts. Different from the existing math-related code, our generated code is paired with natural language reasoning steps, making the code more comprehensible. Also, as our code is generated based on math-related texts, they are all highly related to mathematical reasoning. When used in pretraining, the mathematical code paired with reasoning steps facilitates LLMs' understanding of math-related pretraining texts, as it effectively captures the underlying reasoning process. Furthermore, this data enhances the model's potential to be finetuned for TIR reasoning.

Our data processing pipeline consists of two key steps: (1) carefully curating a robust basic dataset for pretraining, and (2) generating paired reasoning steps and mathematical code by extracting LaTeX expressions and their context, translating the extracted information into Python code snippets, executing the generated code snippets, and verifying their correctness.

Results

The results of MathCoder2 are presented below.

Data Processing

The documentations for generating each part of the MathCode-Pile dataset are as follows:

filtered-OpenWebMath
filtered-CC-En-math
synthetic data
code using math packages
mathematical textbooks
translated mathematical code

The documentation for decontamination is at: decontamination.

Training

The documentation for training is at: training

Testing

The documentation for testing is at: evaluation.

Citation

If you find this repository helpful, please consider citing our papers:

@misc{lu2024mathcoder2bettermathreasoning,
      title={MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code}, 
      author={Zimu Lu and Aojun Zhou and Ke Wang and Houxing Ren and Weikang Shi and Junting Pan and Mingjie Zhan and Hongsheng Li},
      year={2024},
      eprint={2410.08196},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.08196}, 
}

@inproceedings{
wang2024mathcoder,
title={MathCoder: Seamless Code Integration in {LLM}s for Enhanced Mathematical Reasoning},
author={Zimu Lu and Aojun Zhou and Zimu Lu and Sichun Luo and Weikang Shi and Renrui Zhang and Linqi Song and Mingjie Zhan and Hongsheng Li},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=z8TW0ttBPp}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!