MathCoder2

This repository contains files for data processing and continued pretraining to reproduce the paper "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code".

🤗 Hugging Face | 📑 Paper ｜ 📖 Project Page

Dataset and Models

Our dataset, MathCode-Pile is released at MathCode-Pile. Currently, only part of the dataset is released. You can generate the rest using the code in this repository. The full dataset will be released later.

The MathCoder2 models are as follows:

Model Name	Huggingface Link
MathCoder2-Llama-3-8B	🤗 link
MathCoder2-DeepSeekMath-7B	🤗 link
MathCoder2-Mistral-7B	🤗 link
MathCoder2-CodeLlama-7B	🤗 link

FastText Models:

Model Name	Huggingface Link
fastText-cc-en-filter_round1	🤗 link
fastText-cc-en-filter_round2	🤗 link

Introduction

Although utilizing existing open-source code in the pretraining phase can enhance the mathematical reasoning abilities of LLMs, such code often lacks accompanying natural language explanations or context. This might hinder the model's ability to fully understand them. In this paper, we propose a novel method for generating large amounts of mathematical code accompanied by corresponding natural language reasoning steps, which are extracted from math-related pretraining texts. Different from the existing math-related code, our generated code is paired with natural language reasoning steps, making the code more comprehensible. Also, as our code is generated based on math-related texts, they are all highly related to mathematical reasoning. When used in pretraining, the mathematical code paired with reasoning steps facilitates LLMs' understanding of math-related pretraining texts, as it effectively captures the underlying reasoning process. Furthermore, this data enhances the model's potential to be finetuned for TIR reasoning.

Our data processing pipeline consists of two key steps: (1) carefully curating a robust basic dataset for pretraining, and (2) generating paired reasoning steps and mathematical code by extracting LaTeX expressions and their context, translating the extracted information into Python code snippets, executing the generated code snippets, and verifying their correctness.

Results

The results of MathCoder2 are presented below.

Data Processing

The documentations for generating each part of the MathCode-Pile dataset are as follows:

The documentation for decontamination is at: decontamination.

Training

The documentation for training is at: training

Testing

The documentation for testing is at: evaluation.

Citation

If you find this repository helpful, please consider citing our papers:

@misc{lu2024mathcoder2bettermathreasoning,
      title={MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code}, 
      author={Zimu Lu and Aojun Zhou and Ke Wang and Houxing Ren and Weikang Shi and Junting Pan and Mingjie Zhan and Hongsheng Li},
      year={2024},
      eprint={2410.08196},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.08196}, 
}

@inproceedings{
wang2024mathcoder,
title={MathCoder: Seamless Code Integration in {LLM}s for Enhanced Mathematical Reasoning},
author={Zimu Lu and Aojun Zhou and Zimu Lu and Sichun Luo and Weikang Shi and Renrui Zhang and Linqi Song and Mingjie Zhan and Hongsheng Li},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=z8TW0ttBPp}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_processing		data_processing
figures		figures
test		test
train		train
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MathCoder2

Dataset and Models

Introduction

Results

Data Processing

Training

Testing

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mathllm/MathCoder2

Folders and files

Latest commit

History

Repository files navigation

MathCoder2

Dataset and Models

Introduction

Results

Data Processing

Training

Testing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages