English | 简体中文
[TourSynbioTM] is an advanced protein language model that integrates knowledge from the field of proteins. Based on InternLM2-Chat-7B, it is fine-tuned using the Xtuner toolkit and the SFT (Supervised Fine-Tuning) dataset from ProteinLMBench. TourSynbioTM not only understands human language but also the sequences of proteins—the language of life. It seamlessly bridges the gap between specialized protein data and general language, making complex data and information easier to understand and apply. Its powerful reasoning capabilities allow it to extract valuable insights from complex data, accelerating the process of scientific discovery.
[2024.06.23] TourSynbioTM (SFT only) is now open source.
From OpenXLab
Refer to Download Model.
pip install openxlab
from openxlab.model import download
download(model_repo=[model_link],
model_name=[model_link], output='./')
-
Get the project code from Github
git clone (ourlink) python (start_file_name)
-
Create and activate a virtual environment
conda env create -f environment.yml conda activate (envName) pip install -r requirements.txt
-
Run the demo
streamlit run web_demo.py --server.address=0.0.0.0 --server.port=8501
- Introduction
XTuner supports fine-tuning large language models. For a dataset preprocessing guide, please refer to the documentation. For a fine-tuning guide, please refer to the documentation.
-
Step 1: Format the single round dialogue data format for XTuner, for example.
-
[{ "conversation":[ { "system": "xxx", "input": "xxx", "output": "xxx" } ] }, { "conversation":[ { "system": "xxx", "input": "xxx", "output": "xxx" } ] }] # demo { "conversation": [ { "system": "Please evaluate the following protein sequence and provide an explanation of the enzyme's catalytic activity, including the chemical reaction it facilitates: ", "intput": "<seq> M P G R Q L T E L L T G L E E V K V Q T A M E Q K E M M I G G L T A D S R E V R P G D L F A A L P G A R V D G R D F I D Q A V G R G A D V V L A P V G T S L K D Y G R P V S L V T S D E P R R T L A Q M A A R F H G R Q P R T I A A V T G T S G K T S V A D F L R Q I W T L A D R K A A S L G T L G L I P A T A A S K A P P Y L T T P D P V A L H A C L K E V A E A G Y E H L A L E A S S H G L D Q Y R L D G L T F S A A A F T N L S Q D H L D Y H P D M E S Y L N A K A R L F G D L L P T G A T A V L N A D A P E F D R L A A L C E R R G I E V L S Y G L A G D D L R I V E A R A L P D G I A L S L R V K G Q D W Q G K L D L I G T F Q G H N V L A A L G L A L A T G L E P S V A L E A L P K L V G V P G R L Q R V A Q T V S G A Q V F V D Y A H K P G A L E A A L T A L R P H A E G R L I V V F G A G G D R D R G K R P L M G E I A T R L A D V V L V T D D N P R S E D P V A I R A E I L A A A P G A R E V S D R G G A I A A A L A E A D P G D L V L I A G K G H E T G Q I V G D K V L P F D D S E I A R R L A R G G Q V </seq>", "output": "By examining the input protein sequence, the enzyme catalyzes the subsequent chemical reaction: ATP + meso-2,6-diaminoheptanedioate + UDP-N-acetyl-alpha-D- muramoyl-L-alanyl-D-glutamate = ADP + H(+) + phosphate + UDP-N- acetyl-alpha-D-muramoyl-L-alanyl-gamma-D-glutamyl-meso-2,6- diaminoheptanedioate." } ] }
-
Step 2: Configure the XTuner config file.
XTuner provides multiple out of the box configuration files, which users can view through the following commands:
xtuner list-cfg
Alternatively, if the provided configuration file does not meet the usage requirements, please export the provided configuration file and make the corresponding changes:
xtuner copy-cfg ${CONFIG_NAME} ${SAVE_PATH} vi ${SAVE_PATH}/${CONFIG_NAME}_copy.py
To configure the config file, you can first copy the official
internlm2-chat-7b
config file, then rename the copied config file tointernlm2-chat-7bprotein-lora. py
and make the necessary modifications,... custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/cpfs01/shared/gmai/xtuner_workspace/internlm/internlm2-7b/', # PATH/TO/PRETRAINED MODELS trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.DatasetInfoHook'), ] data_path = [ '/cpfs01/shared/gmai/xtuner_workspace/protein_data/formated_ssl_data/sll_data_0.json', # PATH/TO/DATA ... ] ... model = dict( llm=dict( pretrained_model_name_or_path= '/cpfs01/shared/gmai/xtuner_workspace/internlm/internlm2-7b/', # PATH/TO/PRETRAINED MODELS torch_dtype='torch.float16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( # LoRA bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune') ...
The main changes are pretrained model path, data path, and fine-tuning method (LoRA). Other hyperparameters can be adjusted as needed. Here, we keep the defaults.
Note:
Both the SFT and SSL stages involve modifying the config file, with the same modification method. However, the input
in SSL data is empty during data construction. For detailed pretrained data construction, see the documentation.
- Step 3: Start fine-tuning.
xtuner train internlm2_7b_protein_lora
For example, you can fine-tune InternLM2-Chat-7B on the protein dataset using the LoRA algorithm:
# Single GPU
xtuner train internlm2_7b_protein_lora --deepspeed deepspeed_zero2
# Multiple GPUs
(DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_7b_protein_lora --deepspeed deepspeed_zero2
(SLURM) srun ${SRUN_ARGS} xtuner train internlm2_7b_protein_lora --launcher slurm --deepspeed deepspeed_zero2
-
--deepspeed
indicates using DeepSpeed 🚀 to optimize the training process. XTuner includes multiple strategies, including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this parameter. -
For more examples, please refer to the documentation.
-
Step 4: Convert the saved PTH model (if using DeepSpeed, it will be a folder) to a HuggingFace model:
xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH} ${SAVE_PATH}
This project is licensed under the Apache License 2.0.