Run:
python data_processing/basic_data/synthetic_data/download.py
This script generates download the files under data/Maths-College
, data/Education-College-Students
, and data/Matrix_math_science
.
Run:
python data_processing/basic_data/synthetic_data/get_filtered_math-related_by-rule.py
This detect mathematical expressions in the texts and remove those with no mathematical expressions.
Get the Docker from text-generation-inference, and run deploy.sh
in the Docker environment to start the server hosting Mixtral-8x7B-Instruct.
Run:
bash data_processing/basic_data/synthetic_data/process.sh
This file generates annotated files under data/Education-College-Students_filtered-by-rule_filtered-by-model
.
Run:
python data_processing/basic_data/synthetic_data/get_filtered_math-related.py
This generates filtered files under the directory data/synthetic_filtered
.
We observe that texts in Maths-College are mostly highly related to math, so we directly move the Maths-College files to data/synthetic_filtered:
mv data/Maths-College/* data/synthetic_filtered
Run:
python data_processing/basic_data/synthetic_data/get_train_test_files.py
Run:
bash data_processing/basic_data/synthetic_data/train.sh
Run:
bash data_processing/basic_data/synthetic_data/filter.sh
Run:
python data_processing/basic_data/synthetic_data/get_filtered_jsonl.py