Skip to content

Latest commit

 

History

History
190 lines (123 loc) · 6.64 KB

File metadata and controls

190 lines (123 loc) · 6.64 KB

Evaluation of large language models for discovery of gene set function

Description

Code associated with paper "Evaluation of large language models for discovery of gene set function"

Dependencies

Set up an environment

conda create -n llm_eval python=3.11.5

Set up an environment variable to store GPT-4 API key

conda activate llm_eval
conda env config vars set OPENAI_API_KEY="<your api key>" 
conda deactivate  # reactivate 

conda activate llm_eval
echo $OPENAI_API_KEY # make sure the key setup 

%python
import os
import openai
 
openai.api_key = os.environ["OPENAI_API_KEY"]

From OpenAI website for the best practice for API key safety

Python requirements:

The code was developed using Python 3.11.5.

git clone git@github.com:idekerlab/llm_evaluation_for_gene_set_interpretation.git

cd llm_evaluation_for_gene_set_interpretation

pip install -r requirements.txt

UPDATE 12/17/2024: openai package requires an httpx version that is not compatible with their function, manually downgrade httpx to 0.27.2 until OpenAI fixed their bug

pip uninstall httpx
pip install httpx==0.27.2

DDOT is required for downloading GO and can be installed in one of two ways:

To install DDOT by downloading the zip file of the source tree:

wget https://github.com/idekerlab/ddot/archive/refs/heads/python3.zip
unzip python3.zip
cd ddot-python3
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

To install DDOT by cloning the repo:

git clone --branch python3 https://github.com/idekerlab/ddot.git
cd ddot
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

Documentation

The notebooks are numbered according to the evaluation steps

  1. Data Preperation (this step can be omitted for testing purposes)

    The data is already in the data directory (refer to the README in this directory for detail information about the data)

    If need to download GO, follow the code below:

    ## download and parse GO_BP terms
    outdir = 'data/GO_BP/'
    namespace = 'biological_process'
    python process_the_gene_ontology.py $outdir --namespace $namespace 
    

    and the notebook for parsing GO terms

    The addition of contamination to the gene set is filed in this notebook

    If need to download Omics data, run notebook. The notebook processes the omics data and saves them into a tab delimited text file.

  2. Query GPT-4 for names and supporting analysis and run functional enrichment

    GO gene set GPT-4 analysis is stored in Run_LLM_analysis

    GO gene set analysis with different models

    Batch run 1000 GO terms using slurm job with the parameter file

    omic gene set GPT-4 analysis and omics gene set gProfiler

    ## example code to process from 1st to 5th terms in the table
    # run in the command line  
    
    input_file='data/GO_term_analysis/toy_example.csv' #input table path
    config='./jsonFiles/GOLLMrun_config.json' #configuration file 
    set_index='GO' #index of the table
    gene_column='Genes' #name of the gene list column
    start=0
    end=5   
    out_file='data/GO_term_analysis/LLM_processed_toy_example_gpt_4' #output path prefix
    
    source activate llm_eval
    # Run the Python script for the given range
    python query_llm_for_analysis.py --config $config \
                --initialize \
                --input $input_file \
                --input_sep  ','\
                --set_index $set_index \
                --gene_column $gene_column\
                --gene_sep ' ' \
                --start $start \
                --end $end \
                --output_file $out_file
    
  3. Semantic Similarity evaluation of names

    GO gene set analysis evalution

    # get the ranking of similarities from the GO gene set analysis
    
    python rank_GOterm_LLM_sim_rand.py --input_file ./data/GO_term_analysis/LLM_processed_toy_example_w_contamination_gpt_4.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 3 --output_file ./data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv --background_file data/GO_term_analysis/all_go_sim_scores_toy.txt
    
  4. Further evaluation of the performance: model comparison evaluation, gene set functional enrichment, and gene set similarity comparison Evaluation Task 1 related

    Model Comparison

    Analysis related to Fig. 2A Compare the semantic similarities between models

    Analysis related to Fig. 3 Run GO gene set functional enrichment for control

    Compare the confidence score between real, contaminated, and random gene sets

    Check broader concepts of the LLM names

    Analysis for Fig. 2d

    Analysis for whether the best matching GO term is a broader concept as the queried term

    Evaluation Task 2 related Count genes supporting LLM name, then calculate LLM name Jaccard Index

    Analysis related to Fig.4

    Omics data naming evaluation

    Evaluate LLM name matching with any significantly enriched GO term name, use this notebook

  5. Development and assessment of the citation module

  6. Quantification of citation module check citation module

  7. Visualization of results

    extended data fig.1 + Fig.2 + Fig.3

    extract sub hierarchy (Fig.2e)

    Omics figures (Fig 4, Extended Data Fig.5)

License

MIT License

Citing

Hu, M., Alkhairy, S., Lee, I. et al. Evaluation of large language models for discovery of gene set function. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02525-x