This is our implementation for the paper:
Jialin He*, Lei Xiong*#, Shaohui Shi, Chengyu Li, Kexuan Chen, Qianchen Fang, Jiuhong Nan, Ke Ding, Yuanhui Mao, Carles A. Boix, Xinyang Hu, Manolis Kellis, Jingyun Li and Xushen Xiong#. Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants.
Translatomer is a transformer-based multi-modal deep learning framework that predicts ribosome profiling track using genomic sequence and cell-type-specific RNA-seq as input.
If you want to use our codes and datasets in your research, please cite:
@article{He2024.11.23.translatomer,
title = {Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants},
author = {Jialin He and Lei Xiong, Shaohui Shi and Chengyu Li and Kexuan Chen and Qianchen Fang and Jiuhong Nan and Ke Ding, Yuanhui Mao and Carles A. Boix and Xinyang Hu and Manolis Kellis and Jingyun Li, Xushen Xiong},
year = {2024},
doi = {10.1038/s42256-024-00915-6},
publisher = {},
url = {https://doi.org/10.1038/s42256-024-00915-6},
journal = {Nature Machine Intelligence}
}
To run this project, you need the following prerequisites:
- Python 3.9
- PyTorch 1.13.1+cu117
- Other required Python libraries (please refer to requirements.txt)
You can install all the required packages using the following command:
conda create -n pytorch python=3.9.16
conda activate pytorch
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
Example data for model training can be downloaded from Zenodo
- Put all input files in a data folder. The input files have to be organized as follows:
+ data
+ hg38
+ K562
+ GSE153597
+ input_features
++ rnaseq.bw
+ output_features
++ riboseq.bw
+ HepG2
+ GSE174419
+ input_features
++ rnaseq.bw
+ output_features
++ riboseq.bw
*...
++ gencode.v43.annotation.gff3
++ hg38.fa
++ hg38.fai
++ mean.sorted.bw
+ mm10
*...
- To generate training data, use the following command:
python generate_features_4rv.py [options]
[options]:
- --assembly Genome reference for the data. Default = 'hg38'.
- --celltype Name of the cell line. Default = 'K562'.
- --study GEO accession number for the data. Default = 'GSE153597'.
- --region_len The desired sequence length (region length). Default = 65536.
- --nBins The number of bins for dividing the sequence. Default = 1024.
Example to run the codes:
find data/ -type d -name 'output_features' -exec mkdir -p '{}/tmp' \;
find data/ -type d -name 'input_features' -exec mkdir -p '{}/tmp' \;
nohup python generate_features_4rv.py --assembly hg38 --celltype HepG2 --study GSE174419 --region_len 65536 --nBins 1024 &
nohup python generate_features_4rv.py --assembly hg38 --celltype K562 --study GSE153597 --region_len 65536 --nBins 1024 &
To train the Translatomer model, use the following command:
python train_all_11fold.py [options]
[options]:
- --seed Random seed for training. Default value: 2077.
- --save_path Path to the model checkpoint. Default = 'checkpoints'.
- --data-root Root path of training data. Default = 'data' (Required).
- --assembly Genome assembly for training data. Default = 'hg38'.
- --model-type Type of the model to use for training. Default = 'TransModel'.
- --fold Which fold of the model training. Default='0',
- --patience Epochs before early stopping. Default = 8.
- --max-epochs Max epochs for training. Default = 128.
- --save-top-n Top n models to save during training. Default = 20.
- --num-gpu Number of GPUs to use for training. Default = 1.
- --batch-size Batch size for data loading. Default = 32.
- --ddp-disabled Flag to disable ddp (Distributed Data Parallel) for training. If provided, it will enable DDP with batch size adjustment.
- --num-workers Number of dataloader workers. Default = 1.
Example to run the codes:
nohup python train_all_11fold.py --save_path results/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold0 --data-root data --assembly hg38 --dataset data_roots_mini.txt --model-type TransModel --fold 0 --patience 6 --max-epochs 128 --save-top-n 128 --num-gpu 1 --batch-size 32 --num-workers 1 >DNA_logs/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold0.log 2>&1 &
nohup python train_all_11fold.py --save_path results/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold1 --data-root data --assembly hg38 --dataset data_roots_mini.txt --model-type TransModel --fold 1 --patience 6 --max-epochs 128 --save-top-n 128 --num-gpu 1 --batch-size 32 --num-workers 1 >DNA_logs/bigmodel_h512_l12_lr1e-5_wd0.05_ws2k_p32_fold1.log 2>&1 &
- Load pretrained model Pretrained model can be downloaded from Zenodo
- An example notebook containing code for applying Translatomer is here.
This project is licensed under MIT License.
For any questions or inquiries, please contact xiongxs@zju.edu.cn.