We provide each preprocessed dataset in this directory. We also provide some of the raw datasets that are used to proceed the data augmentation step. Dataset that are omitted are accessible from each site.

Math Aqua: https://huggingface.co/datasets/aqua_rat GSM8K: https://huggingface.co/datasets/gsm8k Math: https://github.com/hendrycks/math
Code Conala: https://huggingface.co/datasets/neulab/conala Mbpp: https://huggingface.co/datasets/mbpp DrRepair: https://github.com/michiyasunaga/DrRepair DeepMind CodeContests: https://github.com/deepmind/code_contests
ShareGPT lm-sys/FastChat#90

How to make the training dataset for our model:

unzip alpaca and sharegpt dataset ''' tar -zxvf alpaca/alpaca.tar.gz tar -zxvf sharegpt/sharegpt.tar.gz '''
merge all the preprocessed dataset into one ''' python merge_data.py '''

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls