PhenoGPT

PhenoGPT is an advanced phenotype recognition model, leveraging the robust capabilities of large language models. It employs a fine-tuned implementation on the publicly accessible BiolarkGSC+ dataset, to enhance prediction accuracy and alignments. Like GPT's broad utilization, PhenoGPT can process diverse clinical abstracts for improved flexibility. For enhanced model precision and specialization, you have the option to further fine-tune the proposed PhenoGPT model on your own clinical datasets. This process is elaborated in the subsequent section.

Llama 2 is the default model as it performs the best compared to other models such as GPT-J and Falcon.

PhenoGPT is distributed under the MIT License by Wang Genomics Lab.

Update in 2024: The latest development of PhenoGPT is PhenoGPT2. Compared to PhenoGPT, the main differences are: (1) we have used Llama 3.1 as the base model for fine-tuning HPO recognition model, with noticeable improvements in accuracy of phenotype recognition as evaluated on several data sets. (2) we have implemented the ability of direct tokenization of HPO ID in the model (i.e., "HP:1234567" is treated as one token, rather than several tokens, in the model), to minimize the HPO hallucination problem. (3) We used a larger and more comprehensive corpus to fine-tune the model, with the ability to extract demographic data (such as sex, age and ethnicity/race) from clinical notes.

Installation

We need to install the required packages for model fine-tuning and inference.

conda create -n llm_phenogpt python=3.11
conda activate llm_phenogpt
conda install pandas numpy scikit-learn matplotlib seaborn requests joblib
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
conda install nvidia/label/cuda-12.1.0::cuda-tools
conda install -c conda-forge jupyter
conda install intel-openmp blas
conda install mpi4py
pip install transformers datasets
pip install fastobo sentencepiece einops protobuf
pip install evaluate sacrebleu scipy accelerate deepspeed
pip install git+https://github.com/huggingface/peft.git
pip install flash-attn --no-build-isolation
pip install xformers
pip install bitsandbytes
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=llm_phenogpt

In the command above, we utilize the accelerate package for model sharding. PEFT package is used for efficient fine-tuning like LORA. bitsandbytes package is used for model quantization. Please pip uninstalll package and pip install package if you encounter any running issues.

We need to install the required packages for BioSent2Vec model to convert medical terms to HPO ID

pip install nltk
conda install scipy

Please follow the steps in the BioSent2Vec tutorial and issue to install BioSent2Vec properly.

Set Up Model, Input, and Output directories

Models:
- To use LLaMA 2 model, please apply for access first and download it into the local drive. Instruction
- Save model in the ./model/llama2/llama2_base/
- Download the updated fine-tuning LoRA weights in the release section on GitHub (Latest version: v1.1.0)
- Save LoRA weights in the ./model/llama2/
- Setups for Falcon 70B and Llama 1 7B models are similar.
Input:
- Input files should be txt files
- Input argument can be either a single txt file or a whole directory containing all input txt files
- Please see the input and output directories for reference
BioSent2Vec:
- To use BioSent2Vec model, please see the BioSent2Vec tutorial above. Then, do the following steps:
- mkdir ./BioSent2Vec/model
- cd ./Biosent2Vec/model
- wget https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioSentVec_PubMed_MIMICIII-bigram_d700.bin

Fine-tuning

You can reproduce PhenoGPT model with different base models on BiolarkGSC+ dataset. To fine-tune a specialized phenotype recognition language model, we recommend to follow this notebook script for details. (The notebook is for both llama and falcon model implementation. For gpt-j, please refer to this script.)

Inference

If you want to simply implement PhenoGPT on your local machine for inference, the fine-tuned models are saved in the model directory. Please follow the inference section of the script to run your model.

Please use the following command:

python inference.py -i your_input_folder_directory -o your_output_folder_directory -id yes

-id: specify 'yes' if you want to obtain the corresponding HPO ID to the detected phenotypes, otherwise 'no' (default: 'yes')

Regarding PhenoBCBERT

Since PhenoBCBERT was fine-tuned on the CHOP Proprietary dataset, we cannot publish the model publicly. Please refer to the paper for results.

Citation

Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., & Wang, K. (2023). Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns (New York, N.Y.), 5(1), 100887. https://doi.org/10.1016/j.patter.2023.100887

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
model		model
testing		testing
LICENSE		LICENSE
README.md		README.md
hpo_database.json		hpo_database.json
inference.py		inference.py
run_phenogpt.ipynb		run_phenogpt.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhenoGPT

Installation

Set Up Model, Input, and Output directories

Fine-tuning

Inference

Regarding PhenoBCBERT

Citation

About

Releases 2

Packages

Contributors 3

Languages

License

WGLab/PhenoGPT

Folders and files

Latest commit

History

Repository files navigation

PhenoGPT

Installation

Set Up Model, Input, and Output directories

Fine-tuning

Inference

Regarding PhenoBCBERT

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages