RUN-DVC: Generalizing DL-based variant callers (DVC) via domain adaptation and semi-supervised learning
Contact: Youngmok Jung, Dongsu Han, Young Seok Ju
Email: tom418@kaist.ac.kr, dhan.ee@kaist.ac.kr, ysju@kaist.ac.kr
- Deploying deep learning-based variant callers (DVCs) to a sequencing method with varying error profiles necessitates generalization which is challenging due to their reliance on extensive variantlabeled sequencing data.
- We introduce RUN-DVC, a novel generalization framework for DVCs, which addresses this challenge by treating the deployment to a sequencing method of interest as a domain adaptation and semisupervised learning problem.
- RUN-DVC demonstrates that sequencing error profiles can be effectively learned from unlabeled datasets specific to the sequencing method of interest, leveraging data augmentations, domain adaptation, and semi-supervised learning as key components of the framework.
- Our findings indicate that existing DVCs can benefit from integrating unlabeled datasets from the sequencing methods of interest alongside existing labeled datasets, leading to the development of more robust models or the generalization of models to new sequencing methods with fewer labeled datasets.
- Using only unlabeled data from a sequencing method of interest, RUN-DVC improves variant calling accuracy up to 6.40%p in SNP F1-score and 9.36%p in INDEL F1-score. See results for further detail.
- RUN-DVC achieves the same variant calling accuracy of the supervised training approach using merely half of the labeled data. See results for further detail.
- Prerequisites
- Environment setting
- Dataset generation
- Training Pileup model
- Training CNN model with RUN-DVC
- Calling variants with trained models
- Evaluation with hap.py
- Notes
- Acknowledgement
- Citation
- Generate labeled datasets from source domain.
- Train pileup RNN model using source domain datasets.
- Generate unlabeled datasets from sequencing method of interest (target domain).
- Train CNN model using both labeled and unlabeled datasets.
- Variant calling with the trained model in the sequencing method of interest.
- More than 100GB of RAM (minimum 64GB)
- this is due to multi-process data loader, revise the number of data loader thread in
train_rundvc.py
file.
- this is due to multi-process data loader, revise the number of data loader thread in
- More than 10GB of GPU memory size
- More than 32 vCPUs
- Sequencing datasets (BAM and BAI files) to use as source domain (e.g., NovaSeq PCR-free dataset) link
- Sequencing datasets (BAM and BAI files) to use as target domain (e.g., NovaSeq PCR-plus dataset) link
- Truth VCF and BED files for both source and target domain (for evaluation) link
- Reference genome
- [optional] Genome stratifications files link
- Please install NVIDIA Docker Install Page Link
- You can use the docker-hub image or the Dockerfile_torch
# Please check your NVIDIA-driver version and supported Pytorch version
# For Pytorch 1.10
docker pull tom418/rundvc:1.10
# For Pytorch 2.x
docker pull tom418/rundvc:2.0
- or you can build your image locally
# Below builds RUNDVC docker image with Pytorch 1.10, change to 2.0 to use Pytorch 2.0
mkdir tmp_docker/
cp Dockerfile_torch_1.10 tmp_docker/Dockerfile
cd tmp_docker
docker build . --tag rundvc:1.10
- Run container with below command
sudo docker run -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,1 -v /<Your Folder>/:/data --ipc=host --name rundvc <Image_Name>
- Before proceeding, activate conda environment inside docker container
conda activate rundvc
- All commands should be executed inside the docker container.
cd preprocess/realign
g++ -std=c++14 -O1 -shared -fPIC -o realigner ssw_cpp.cpp ssw.c realigner.cpp
g++ -std=c++11 -shared -fPIC -o debruijn_graph -O3 debruijn_graph.cpp -I ${CONDA_PREFIX}/include -L ${CONDA_PREFIX}/lib
cd ../..
- We provide a template bash script for generating datasets in data_scripts folder.
- Pair of two files (env_XXX.sh and make_dataset_XXX.sh) is required.
- Please see
env.sh
for configuration andmake_dataset.sh
for labeled datasets generation. - To generate pileup dataset see
make_dataset_pileup.sh
, it uses the sameenv.sh
file for configuration. - Note that, ONT datasets requires pileup model to generate full alignment datasets.
- Pileup model trained in the source domain is required for generating unlabeled datasets, please download or generate pileup model before proceeding.
- Pileup model is used to score candidate variants and is used for filtering when there are excessive number (over 20 million per human genome) of candidate variants. However, most of the short read datasets have candidate variants less than 20 million per single genome.
- Please see
env_ul.sh
andmake_ul_dataset.sh
for unlabeled datasets generation without Truth VCF files. make_ul_dataset.sh
assumes INDEL realignment is alreay performed during labeled dataset generation. Uncomment the realignment code to use it.
- RUN-DVC use source domain datasets to train pileup model.
- Pileup model is used for phasing long reads or filtering candidate variants at unlabeled dataset generation.
# Below is an example of training pileup model
MODEL_FOLDER_PATH="data/train_result_RUNDVC"
BINS_FOLDER_PATH="/data/data_bins_fix/data_bin_P_novaseq_pcr"
python /root/RUN-DVC/RUNDVC/Train_torch.py \
--bin_fn ${BINS_FOLDER_PATH} \
--ochk_prefix ${MODEL_FOLDER_PATH}/pileup \
--add_indel_length True \
--random_validation \
--pileup \
--platform ilmn
- Training RUNDVC.
# Below is an example of training RUNDVC script for Short reads
MODEL_FOLDER_PATH="data/train_result_RUNDVC"
mkdir -p ${MODEL_FOLDER_PATH}
BINS_FOLDER_PATH="/data/data_bins_fix/data_bin_F_novaseq_pcr"
UL_BINS_FOLDER_PATH="/data/data_bins_fix/data_NOSELECT_novaplus"
PLATFORM="ilmn"
python /root/RUN-DVC/RUNDVC/train_rundvc.py \
--random_validation --maxEpoch 50 \
--bin_fn ${BINS_FOLDER_PATH} \
--bin_fn_ul ${UL_BINS_FOLDER_PATH} \
--ochk_prefix ${MODEL_FOLDER_PATH} \
--platform ${PLATFORM} \
--USE_SWA True --swa_start_epoch 19
- Training BaselineBN or Full-label model and ablation study
# You can use below arguments to disable SSL or RLI module.
# If you disable those, it works in the supervised training method, BaselineBN
--USE_RLI False --USE_RLI False
# Below is an example for training in supervised manner, while providing validation error for both source and target domain.
MODEL_FOLDER_PATH="data/train_result_RUNDVC"
mkdir -p ${MODEL_FOLDER_PATH}
BINS_FOLDER_PATH="/data/data_bins_fix/data_bin_F_novaseq_pcr"
UL_BINS_FOLDER_PATH="/data/data_bins_fix/data_NOSELECT_novaplus"
PLATFORM="ilmn"
python /root/RUN-DVC/RUNDVC/train_rundvc.py \
--random_validation --maxEpoch 50 \
--bin_fn ${BINS_FOLDER_PATH} \
--bin_fn_ul ${UL_BINS_FOLDER_PATH} \
--ochk_prefix ${MODEL_FOLDER_PATH} \
--platform ${PLATFORM} \
--USE_SWA True --USE_RLI False --USE_RLI False --swa_start_epoch 19
- Training RUNDVC in SSDA setting
# Additional parameters are used for SSDA setting
# --bin_fn_tl is for the directory of target labeled datasets
# --tl_size is the amount of labeled datasets to use
# --seed for setting random seed
# Below is an example for training in SSDA setting.
MODEL_FOLDER_PATH="data/train_result_RUNDVC"
mkdir -p ${MODEL_FOLDER_PATH}
BINS_FOLDER_PATH="/data/data_bins_fix/data_bin_F_novaseq_pcr"
TL_BINS_FOLDER_PATH="/data/data_bins_fix/data_NOSELECT_novaplus"
UL_BINS_FOLDER_PATH="/data/data_bins_fix/data_NOSELECT_novaplus"
PLATFORM="ilmn"
python /root/RUN-DVC/RUNDVC/train_rundvc.py \
--random_validation --maxEpoch 50 \
--bin_fn ${BINS_FOLDER_PATH} \
--bin_fn_ul ${UL_BINS_FOLDER_PATH} \
--bin_fn_tl ${TL_BINS_FOLDER_PATH} \
--ochk_prefix ${MODEL_FOLDER_PATH} \
--platform ${PLATFORM} \
--USE_SWA True --swa_start_epoch 19 --tl_size 100000 --seed 1
- Use below command to merge encoder and classifier.
# Please revise the arguments
python /root/RUN-DVC/RUNDVC/merge_model.py --model <SAVE_MODEL.best> --output <Output Directory> --platform <ilmn,hifi,ont>
- See
rundvc_callvariants.sh
or example below.
# Please revise the arguments before using.
# For short reads
mkdir -p /data/output/calls/
BAM_FILE="/data/HG003.novaseq.pcr-free.30x.dedup.grch38.bam"
DATA_NAME="baseline_novaplus2novafree"
MODEL="/data/rundmc/baseline_novaplus.pt"
./run_rundmc.sh \
--rundvc_call_mut \
--bam_fn=${BAM_FILE} \
--bed_fn=/data/data_HG00X/HG003_GRCh38_1_22_v4.2.1_benchmark.bed \
--ref_fn=/data/human_ref/hg38/Homo_sapiens_assembly38.fasta \
--threads=94 \
--chunk_num=50 \
--platform="ilmn" \
--fa_model=${MODEL} \
--no_phasing_for_fa \
--output=/data/output/calls/rundvc_${DATA_NAME}
# for Pacbio HIFI with pileup model
P_MODEL="/data/rundmc/p_model.pt"
FA_MODEL="/data/rundmc/fa_model.pt"
./run_rundmc.sh \
--rundvc_call_mut \
--bam_fn=${BAM_FILE} \
--bed_fn=/data/data_HG00X/HG003_GRCh38_1_22_v4.2.1_benchmark.bed \
--ref_fn=/data/human_ref/hg38/Homo_sapiens_assembly38.fasta \
--threads=94 \
--chunk_num=50 \
--platform="hifi" \
--pileup_model=${P_MODEL} \
--fa_model=${FA_MODEL} \
--no_phasing_for_fa \
--output=/data/output/calls/rundvc_${DATA_NAME}
# for ONT with pileup model
P_MODEL="/data/rundmc/p_model.pt"
FA_MODEL="/data/rundmc/fa_model.pt"
./run_rundmc.sh \
--rundvc_call_mut \
--bam_fn=${BAM_FILE} \
--bed_fn=/data/data_HG00X/HG003_GRCh38_1_22_v4.2.1_benchmark.bed \
--ref_fn=/data/human_ref/hg38/Homo_sapiens_assembly38.fasta \
--threads=94 \
--chunk_num=50 \
--platform="ont" \
--pileup_model=${P_MODEL} \
--fa_model=${FA_MODEL} \
--no_phasing_for_fa \
--output=/data/output/calls/rundvc_${DATA_NAME}
# Below is an example of running hap.py with stratifications, please revise the arguments
mkdir -p /data/output/full_label
docker run -it -v <Your Path>:/data pkrusche/hap.py /opt/hap.py/bin/hap.py \
/data/data_HG00X/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/data/rundmc/Fulllabel_hiseqX/merge_output.vcf.gz \
-o /data/output/full_label/full_label_hg003_hiseqX \
-r /data/human_ref/hg38/Homo_sapiens_assembly38.fasta \
-f /data/data_HG00X/HG003_GRCh38_1_22_v4.2.1_benchmark.bed \
--threads 64 --engine=vcfeval --roc QUAL \
--stratification /data/genome-stratifications/GRCh38/v3.1-GRCh38-all-stratifications.tsv
- RUNDVC uses a significant amount of CPU during training owing to data augmentations.
- RUN-DVC heaviliy makes use of open source project Clair3.
This project includes codes from Clair3, which is licensed under BSD 3-Clause License.
The original source code for Clair3 can be found at https://github.com/HKU-BAL/Clair3.
The modifications to Clair3 in this project are made by Youngmok Jung and are licensed under the BSD 3-Clause license.
If you found RUN-DVC interesting and want to cite! Please cite the following paper
@article {Jung2023.08.12.549820,
author = {Youngmok Jung and Jinwoo Park and Hwijoon Lim and Jeong Seok Lee and Young Seok Ju and Dongsu Han},
title = {Generalizing deep variant callers via domain adaptation and semi-supervised learning},
elocation-id = {2023.08.12.549820},
year = {2023},
doi = {10.1101/2023.08.12.549820},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/08/14/2023.08.12.549820},
eprint = {https://www.biorxiv.org/content/early/2023/08/14/2023.08.12.549820.full.pdf},
journal = {bioRxiv}
}
- Training DVC on simulated reads and transfering knowledge to real-world data
- Extending RUN-DVC to somatic mutation calling
- Active learning for improving label-efficiency
- Improving performance of RUN-DVC with hyper-parameter adjustment and data-augmentations
- Evaluating RUN-DVC on different species