RENET2: High-Performance Full-text Gene-Disease Relation Extraction with Iterative Training Data Expansion
Contact: Junhao Su
Email: jhsu@cs.hku.hk
Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at this repo.
RENET2 is published in NAR Genomics and Bioinformatics.
- What's new
- Reference for application
- Installation
- Download Data and Trained Models
- Usage
- Understand Output File
- Dataset
- Modules Descriptions
- Benchmark
- (Optional) Visualization
-
20210716
The paper of RENET2 is published. We updated and fixed the empty parsed dataset problem, and updated the parsed full-text dataset in data/ft_data.
-
20210514
Update README with data link: http://www.bio8.cs.hku.hk/RENET2/renet2_data_models.tar.gz. The full-test annotated dataset is available at
/data/ft_info folder
in the download files. Please check this link1 and link2 for more detail.Add RENET testing script for full-text dataset
-
20210330
We can install RENET2 via bioconda now! and the code for the RENET2 is refined as a python package.
- Microsoft's BiomedNLP-PubMedBERT, from James Morrill. It achieves an F1 score of 0.8 at the abstract dataset.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create conda environment named "renet2-env"
conda create -n renet2-env -c bioconda renet2
conda activate renet2-env
# run renet2 like this afterwards
renet2 --help
# create renet2 env
conda create -n renet2-env python=3.7
conda activate renet2-env
# install required package
conda install -c conda-forge ruby scikit-learn=0.22.2.post1 pandas=1.0.1 numpy=1.18.1 tqdm=4.42.1
conda install pytorch==1.2.0 cudatoolkit=10.0 -c pytorch
git clone https://github.com/sujunhao/RENET2.git
cd RENET2
pip install . --no-deps --ignore-installed
# run renet2 like this afterwards
renet2 --help
Download all required files
All data and models are available at this link: http://www.bio8.cs.hku.hk/RENET2/renet2_data_models.tar.gz, please using the following scripts to download data for RENET2.
### if RENET2 is installed from Bioconda
mkdir RENET2
cd RENET2
RENET2_DATA_S_URL=https://raw.githubusercontent.com/sujunhao/RENET2/main/src/renet2/download_renet2_data.sh
curl -s ${RENET2_DATA_S_URL} | bash -s .
R2_DIR=$(pwd)
### if RENET2 is installed from GitHub
#### make sure you are in the root dir of RENET2
bash src/renet2/download_renet2_data.sh .
R2_DIR=$(pwd)
# quick testing
# R2_DIR="[DATA/MODEL_PATH]" # e.g. ~/git/RENET2, check 'Download Data and Trained Models'
renet2 predict --raw_data_dir ${R2_DIR}/data/ft_data/ --gda_fn_d ${R2_DIR}/data/ft_gda/ --models_number 4 --batch_size 8 --max_doc_num 10 --no_cache_file --model_dir ${R2_DIR}/models/ft_models/
# check predicted results
# predicted gene-disease associations
less ${R2_DIR}/data/ft_gda/gda_rst.tsv
# help page for renet2
renet2 --help
# to run a submodule using python
renet2 [submodule] [options]
R2_DIR="[DATA_MODEL_PATH]" # e.g. ~/git/RENET2, check 'Download Data and Trained Models'
## for using RENET2, please make sure that -
## 1. in RENET2-env environment (using 'conda activate RENET2-env' to setup RENET2 environment)
## 2. follow the 'Download Data and Trained Models' to download RENET2 dataset and trained models first
## 3. setup the `R2_DIR` variable as in 'Setup variables for renet2'
## use --use_cuda if you have GPUs and want to use GPUs
# set RENET2's models dir, noted that trained model already in this dir
MODEL_DIR=${R2_DIR}/models/ft_models
# train 10 RENET2 models (optional, trained model already in the models dir)
MODEL_DIR=${R2_DIR}/models/ft_models_test
renet2 train --raw_data_dir ${R2_DIR}/data/ft_data/ --annotation_info_dir ${R2_DIR}/data/ft_info --model_dir ${MODEL_DIR} --pretrained_model_p ${R2_DIR}/models/Bst_abs_10 --epochs 10 --models_number 10 --batch_size 60 --have_SiDa ${R2_DIR}/data/ft_info/ft_base/ft_base --gda_fn_d ${R2_DIR}/data/ft_gda/ --use_cuda
# use trained RENET2 models to predict GDAs (using --is_sensitive_mode to enable RENET2-Sensitive mode)
# maximum using 10 models to predict
renet2 predict --raw_data_dir ${R2_DIR}/data/ft_data/ --model_dir ${MODEL_DIR} --models_number 2 --batch_size 60 --gda_fn_d ${R2_DIR}/data/ft_gda/ --use_cuda
# check predicted GDAs
less ${R2_DIR}/data/ft_gda/gda_rst.tsv
# apply 5-fold cross-validation to test RENET2 performance
renet2 evaluate_renet2_ft_cv --epochs 10 --raw_data_dir ${R2_DIR}/data/ft_data/ --annotation_info_dir ${R2_DIR}/data/ft_info/ --rst_file_prefix ft_base --have_SiDa ${R2_DIR}/data/ft_info/ft_base/ft_base --pretrained_model_p ${R2_DIR}/models/Bst_abs_10 --no_cache_file --use_cuda
Input: PMID and PMCID list [example: RENET2/test/test_download_pmcid_list.csv]
Output: Gene-Disease Assoications [example: will generate at RENET2/data/test_data/gda_rst.tsv]
pipeline with example
Input data: PMID and PMCID list ${R2_DIR}/test/test_download_pmcid_list.csv
- download text and NER annotations
# download abstract and its annotations
# (download abstract is required for the full-text case, as some full-text at PTC did not have an abstract section, should download separately)
renet2 download_data --process_n 3 --id_f ${R2_DIR}/test/test_download_pmcid_list.csv --type abs --dir ${R2_DIR}/data/raw_data/abs/ --tmp_hit_f ${R2_DIR}/data/test_data/hit_id_l.csv
# download full-text and its annotations
renet2 download_data --process_n 3 --id_f ${R2_DIR}/test/test_download_pmcid_list.csv --type ft --dir ${R2_DIR}/data/raw_data/ft/ --tmp_hit_f ${R2_DIR}/data/test_data/hit_id_l.csv
- parse text and enetities annotations to RENET2 input format
# parse data
renet2 install_geniass # install geniass, only run one time
conda install ruby # install ruby
renet2 parse_data --id_f ${R2_DIR}/test/test_download_pmcid_list.csv --type 'ft' --in_abs_dir ${R2_DIR}/data/raw_data/abs/ --in_ft_dir ${R2_DIR}/data/raw_data/ft/ --out_dir ${R2_DIR}/data/test_data/
# normalize NET ID
renet2 normalize_ann --in_f ${R2_DIR}/data/test_data/anns.txt --out_f ${R2_DIR}/data/test_data/anns_n.txt
- run RENET2 on parsed data
MODEL_DIR=${R2_DIR}/models/ft_models # using the pretrained 10 models at ft_models
renet2 predict --raw_data_dir ${R2_DIR}/data/test_data/ --model_dir ${R2_DIR}/models/ft_models/ --gda_fn_d ${R2_DIR}/data/test_data/ --models_number 4 --batch_size 8 --max_doc_num 10 --no_cache_file
Output data: predicted Gene-Disease Associations are stored in ${R2_DIR}/data/test_data/gda_rst.tsv
to try run RENET2 on abstract, you can using the code as:
renet2 predict --raw_data_dir ${R2_DIR}/data/abs_data/2nd_ann/ \
--model_dir ${R2_DIR}/models/ \
--gda_fn_d ${R2_DIR}/data/test_data/ \
--models_number 1 \
--model_name Bst_abs_10 \
--batch_size 8 \
--no_cache_file \
--fix_snt_n 32 \
--file_name_ann anns.txt
# then go to benchmark folder and run the following to checked the trained models
python calculate_metrics_with_input.py ${R2_DIR}/data/abs_data/2nd_ann/labels.txt ${R2_DIR}/data/test_data/gda_rst.tsv
There are 7 columns in the gda_rst.tsv:
1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|
pmid | geneId | diseaseId | g_name | d_name | prob_avg | prob_X |
where pmid
is Article PubMed Id, geneId
is the Entrez Gene ID (Entrez), diseaseId
is the Disease Id (MESH), g_name
is the gene name (a ID with multiple names will be seperated by '|'), d_name
is the disease name (a ID with multiple names is seperated by '|'), prob_avg
is the predicted mean GDP (gene-disease probability) of all 10 models,
prob_X
is the predicted GDP of each models.
Make sure you downloaded data at the [Download Data and Trained Models] section.
.
├── data
│ ├── ft_data # full-text dataset
│ │ ├── docs.txt # articles with ID/title/abstract/main text
│ │ ├── sentences.txt # sentences from articles [collected from geniass]
│ │ ├── anns.txt # gene/disease annotations [collected from PubTator Central]
│ │ ├── anns_n.txt # gene/disease annotations with normalize annotated ID
│ │ ├── labels.txt # gene-disease assoications table
│ │ └── s_docs.txt # articles with section's ID (for visualization of annotated results)
│ ├── abs_data # abstract dataset
│ │ ├── 1st_ann # abstract dataset, first round
│ │ │ └── ...
│ │ ├── 2nd_ann # abstract dataset, second round [Abstract-exp in paper]
│ │ │ └── ...
│ │ ├── ori # training dataset from RENET
│ │ │ └── ...
│ │ └── ori_test # testing dataset from RENET
│ │ └── ...
│ └── ...
└── ...
Annotated gene-disease associations based on iterative training data expansion strategy. These are the original annotation files, the parsed files are located at the parsed dataset, please check it accordingly.
.
├── data
│ ├── ft_info
│ │ └── ft_500_n.tsv # annotated full-text GDA (fisrt and second round)
│ ├── ann_table
│ │ ├── ann_1st.tsv # annotated abstract GDA (fisrt round)
│ │ └── ann_2nd.tsv # annotated abstract GDA (second round)
│ └── ...
└── ...
Make sure you downloaded data at the [Download Data and Trained Models] section.
.
├── data
│ ├── pmc
│ │ └──gda_rst.tsv # GDA from PMC
│ ├── litcovid
│ │ └──gda_rst.tsv # GDA from LitCovid
│ └── ...
└── ...
Modules in renet2
are for model training/testing.
For the Modules listed below, please use the -h
or --help
option for checking available options.
renet2 |
renet2 program |
---|---|
train |
Module for training RENET2 models. |
predict |
Using RENET2 models to predict gene-disease associations. |
evaluate_renet2_ft_cv |
Evaluating trained RENET2 models and using cross-validation. |
download_data |
Downloading articles from PMC/PTC with provided PMID/PMCID list. (please check example an RENET2/src/nb_scripts/pre_precoss/ for full-text dataset) |
parse_data |
Parsing articles from RENET2. (please check example an RENET2/src/nb_scripts/pre_precoss/ for full-text dataset) |
normalize_ann |
Normlize the annotation ID |
install_geniass |
Install geniass for parse_data module, if fail, please try conda install ruby to install ruby first |
pip install pymongo
pip install regex
cd benchmark/BeFree
git clone git@bitbucket.org:ibi_group/befree.git
wget http://www.bio8.cs.hku.hk/RENET2/renet2_bm_befree.tar.gz
tar -xf renet2_bm_befree.tar.gz
# get BeFree input
run Generate_BeFree_Input.ipynb on python jypyter notebook to genrate BeFree input
sh benchmark_befree.sh
cd benchmark/DTMiner
wget http://www.bio8.cs.hku.hk/RENET2/renet2_bm_dtminer.tar.gz
tar -xf renet2_bm_dtminer.tar.gz
# get DTMiner input
run Generate_DTMiner_Input.ipynb on python jypyter notebook to genrate BeFree input
sh benchmark_DTMiner.sh
cd benchmark/BioBERT
git clone https://github.com/dmis-lab/biobert
cd biobert; pip install -r requirements.txt
./download.sh
# generate BioBERT input
run Generate_BioBERT_Input.ipynb on python jypyter notebook
# run BioBERT
sh run_bert.sh
cd benchmark
run Generate_RENET_Input.ipynb on python jypyter notebook
renet2 evaluate_renet2_ft_cv --epochs 10 --raw_data_dir ${R2_DIR}/data/ft_data/ --annotation_info_dir ${R2_DIR}/data/ft_info/ --rst_file_prefix ft_base --have_SiDa ${R2_DIR}/data/ft_info/ft_base/ft_base --pretrained_model_p ${R2_DIR}/models/Bst_abs_10 --no_cache_file --use_cuda
# training RENET2 model on abstract data
run ./src/nb_scripts/build_best_model_abs.ipynb on jupyter notebook
# Using cross-validation to benchmarking RENET2 on abstract data
run ./src/nb_scripts/exp_abs.ipynb on jupyter notebook
note that RENET2 can benchmark should be benchmark on abstract data via cross validation.
Found and visualze a pair of gene-disease annotation obtrained from Pubtar Central.
run ./src/nb_scripts/vis_text.ipynb on jupyter notebook