Deep learning for modeling gene regulatory network
© Contributors, 2019. Licensed under an [Apache-2] license.
DeepGRN has been developed and used by the Bioinformatics, Data Mining and Machine Learning Laboratory (BDM) . Help from every community member is very valuable to make the tool better for everyone. Checkout the Lab Page.
Python 3.6.6
bedtools 2.26.0
pandas 0.24.2
numpy 1.16.2
tensorflow 1.11.0
pyfaidx 0.5.5.2
pyfasta 0.5.2
pybedtools 0.8.0
pyBigWig 0.3.14
Keras 2.2.4
h5py 2.9.0
deepTools 3.2.1
Optional(for training with GPUs):
cuda 10
tensorflow-gpu 1.11.0
Train models for TF binding site prediction:
-h, --help
show this help message and exit--data_dir DATA_DIR, -i DATA_DIR
path to the input data (required)--tf_name TF_NAME, -t TF_NAME
name of the transcription factor (required)--output_dir OUTPUT_DIR, -o OUTPUT_DIR
output path (required)--genome_fasta_file GENOME_FASTA_FILE, -gf GENOME_FASTA_FILE
genome fasta file (required)--val_chr VAL_CHR, -v VAL_CHR
name for validation chromosome (default: chr11)--bigwig_file_unique35 BIGWIG_FILE_UNIQUE35, -u BIGWIG_FILE_UNIQUE35
35bp uniqueness file, will not use this feature if left empty (default: '')--rnaseq_data_file RNASEQ_DATA_FILE, -r RNASEQ_DATA_FILE
RNA-Seq PCA data file, will not use this feature if left empty (default: '')--gencode_file GENCODE_FILE, -g GENCODE_FILE
Genomic annotation file, will not use this feature if left empty (default: '')--attention_position ATTENTION_POSITION, -ap ATTENTION_POSITION
Position of attention layers, can be attention_after_lstm, attention_before_lstm,attention_after_lstm,attention1d_after_lstm (default: 'attention_after_lstm')--flanking FLANKING, -f FLANKING
flanking length (default: 401)--epochs EPOCHS, -e EPOCHS
epochs (default: 60)--patience PATIENCE, -p PATIENCE
training will stop early if no improvements after n epochs (default: 5)--batch_size BATCH_SIZE, -s BATCH_SIZE
batch size (default: 64)--learningrate LEARNINGRATE, -l LEARNINGRATE
learningrate (default: 0.001)--kernel_size KERNEL_SIZE, -k KERNEL_SIZE
kernel size for Conv1D (default: 34)--num_filters NUM_FILTERS, -nf NUM_FILTERS
number of filters for Conv1D (default: 64)--num_recurrent NUM_RECURRENT, -nr NUM_RECURRENT
Output dim for LSTM (default: 32)--num_dense NUM_DENSE, -nd NUM_DENSE
Output dim for dense layers (default: 64)--dropout_rate DROPOUT_RATE, -d DROPOUT_RATE
dropout rate for all layers except LSTM (default: 0.1)--rnn_dropout1 RNN_DROPOUT1, -rd1 RNN_DROPOUT1
dropout rate for LSTM (default: 0.1)--rnn_dropout2 RNN_DROPOUT2, -rd2 RNN_DROPOUT2
RNN dropout rate for LSTM (default: 0.1)--merge MERGE, -me MERGE
merge method, max or ave (default: ave)--num_conv NUM_CONV, -nc NUM_CONV
Number of Conv1D layers (default: 1)--num_lstm NUM_LSTM, -nl NUM_LSTM
Number of LSTM layers (default: 1)--num_denselayer NUM_DENSELAYER, -dl NUM_DENSELAYER
Number of additional dense layers (default: 1)--ratio_negative RATIO_NEGATIVE, -rn RATIO_NEGATIVE
Ratio of negative samples to positive samples in each epoch (default: 1)--use_peak, -a
should the positive bins sampled from peak regions? (default: OFF)--use_cudnn, -c
use cudnnLSTM instead of LSTM, faster but will disable LSTM dropouts (default: OFF)--single_attention_vector, -sa
merge attention weights in each position by averaging (default: OFF)--positive_weight POSITIVE_WEIGHT, -pw POSITIVE_WEIGHT
weight for positive samples (default: 1)--plot_model, -pl
if the model architecture should be plotted (default: OFF)--random_seed RANDOM_SEED, -rs RANDOM_SEED
random seed (default: 0)--val_negative_ratio VAL_NEGATIVE_RATIO, -vn VAL_NEGATIVE_RATIO
ratio for negative samples in validation (default: 19)
Use models trained from train.py for new data prediction:
-h, --help
show this help message and exit--data_dir DATA_DIR, -i DATA_DIR
path to the input data (required)--model_file MODEL_FILE, -m MODEL_FILE
path to model file (required)--cell_name CELL_NAME, -c CELL_NAME
cell name (required)--predict_region_file PREDICT_REGION_FILE, -p PREDICT_REGION_FILE
predict region file (required)--output_predict_path OUTPUT_PREDICT_PATH, -o OUTPUT_PREDICT_PATH
output path of prediction (required)--bigwig_file_unique35 BIGWIG_FILE_UNIQUE35, -bf BIGWIG_FILE_UNIQUE35
35bp uniqueness file (default: '')--rnaseq_data_file RNASEQ_DATA_FILE, -rf RNASEQ_DATA_FILE
RNA-Seq PCA data file (default: '')--gencode_file GENCODE_FILE, -gc GENCODE_FILE
Genomic annotation file (default: '')--batch_size BATCH_SIZE, -b BATCH_SIZE
batch size (default: 512)--blacklist_file BLACKLIST_FILE, -l BLACKLIST_FILE
blacklist_file to use, no fitering if not provided (default: '')
To generate the figures that we use in our experiment, please refer to these instructions to extract data from trained models and create the plot you are interested in.
The following sections are for users who wish to generate their own training/prediction dataset. If you are interested in the DREAM-ENCODE Challenge 2016 data that we use in our experiment, we have prepared the step by step guideline to generate the input for training and prediction.
Genomic sequence is provided as fasta format. You can download these files from UCSC Genome Browser Please notice that if your validation chromosome name should be one of the chromosome names in this file.
Chromatin accessibility data should be prepared as BigWig format from DNase-Seq experiment. You should name your files as {cell_type}.1x.bw and put them into your/data/folder/DNase/ That is to say, one file for each cell type. These files should be generated from the read alignment files in the standard BAM file format. If you have replicates for the same cell type, you should first merge them with samtools:
samtools merge {cell_type}.bam {cell_type}.rep1.bam {cell_type}.rep2.bam
samtools index {cell_type}.bam
Then you can run the bamCoverage in deeptools to generate the required .bw file for each cell type. Here we use human genome as example:
bamCoverage --bam ${i}.bam -o ${i}.1x.bw --outFileFormat bigwig --normalizeTo1x 2478297382 --ignoreForNormalization chrX chrM --Offset 1 --binSize 1 --blackListFileName blacklist.bed.gz --skipNonCoveredRegions
You can provide a BED or GTF file containing regions that should be excluded from bamCoverage analyses with the --blackListFileName
option. For human genome hg19, we use the low mapability region provided by UCSC Genome Browser
According to UCSC Genome Browser. We use the Duke uniqueness score as an additional input. The Duke excluded regions track displays genomic regions for which mapped sequence tags were filtered out before signal generation and peak calling for Duke/UNC/UTA's Open Chromatin tracks. This track contains problematic regions for short sequence tag signal detection (such as satellites and rRNA genes). The Duke excluded regions track was generated for the ENCODE project.
The Duke uniqueness score tracks display how unique is each sequence on the positive strand starting at a particular base and of a particular length. Thus, the 35 bp track reflects the uniqueness of all 35 base sequences with the score being assigned to the first base of the sequence. Scores are normalized to between 0 and 1 with 1 representing a completely unique sequence and 0 representing the sequence occurs >4 times in the genome (excluding chrN_random and alternative haplotypes). A score of 0.5 indicates the sequence occurs exactly twice, likewise 0.33 for three times and 0.25 for four times. For Human genome, you can use the Duke uniqueness tracks generated for the ENCODE project as tools in the development of the Open Chromatin tracks. For other genomes, you can generate your own sequence uniqueness data following this rule and use wigToBigWig to convert it to the BigWig format.
Follwing the FactorNet framework, we use the first eight principal components from the RNA-Seq experiments for each cell type. These principal components should be generated using the TPM(Transcripts Per Kilobase Million) with replicates merged by averging. You should save this input in your/data/folder/rnaseq_data.csv as a Comma Separated Values file with the first line indicate the type of the cells.
We provided a simple R script data/generate_pca.R to generate this data from TPM data. The input of this script should be a csv file containing a gene by cell type matrix with first row indicate the types of cell and the first column indicate the gene names. Usage:
Rscript generate_pca.R [path/to/input.csv] [path/to/rnaseq_data.csv]
The annotation feature for each bin is encoded as a binary vector of length 6, with each value representing if there is an overlap between the input bin and each of the six genomic features (coding regions, intron, promoter, 5'/3'-UTR, and CpG island). For Human genome Hg19, we use annotations from the FactorNet GitHub repo to generate the required input and save them in example/hg19. For preparing this input for your own genome, you can use our python script: data/generate_annotation.py Usage:
Python generate_annotation.py [gencode_path] [genome_sizes_file] [bed_file] [outfile]
You should prepare your own (gzipped) bed files indicting the genomic features (coding regions, intron, promoter, 5'/3'-UTR, and CpG island) and save them as [cpgisland|cds|intron|promoter|utr5|utr3].bed.gz under the gencode_path
.
genome_sizes_file
is the file containing size of each chromosome. For Hg19 the content is:
chr1 249250621
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr19 59128983
chr2 243199373
chr20 63025520
chr21 48129895
chr22 51304566
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chr8 146364022
chr9 141213431
chrX 155270560
bed_file
is the file indicating the target region and should be the same as the one you use for the prediction script.
In additional to the input feature information mentioned in Prepare input features for training and prediction, you need to prepare the label information for training.
If you need to train your own models with custom data, instead of preparing a file indicating the target region, you should prepare a label file indicating the true labels of the binding status along the chromosome. In the DREAM-ENCODE challange, each bin falls into one of the three types: bound(B), unbound(U), or ambiguous(A), which is determined from the ChIP-Seq results. Bins overlapping with peaks and passing the Irreproducible Discovery Rate (IDR) check with a threshold of 5% are labeled as bound. Bins that overlap with peaks but fail to pass the reproducibility threshold are labeled as ambiguous. All other bins are labeled as unbound. We do not use any ambiguous bins during the training or validation process according to the common practice. Therefore, each bin in the genomic sequence will either be a positive site (bounded) or a negative site (unbounded). Example:
chr start stop A549 H1-hESC HeLa-S3 HepG2 IMR-90 K562 MCF-7
chr10 600 800 U U U U U U U
chr10 650 850 U U U U U U U
chr10 700 900 U U U U U U U
chr10 750 950 U U U U U U U
chr10 800 1000 U U U U U U U
For each TF, you will need one such label file. You should save these (gzipped) label files under data_dir/label/train/ and name them as [TF_name].train.labels.tsv.gz
For TFs that have abundant binding sites, training performance and speed could benefit from sampling positive sample regions from the ChIP-Seq peak regions during each epoch. To do so, you would need to prepare a peak annotation file. We have provided the peak annotations as example/hg19/peak_annotation.gz for the TF and cells we used. For your own customized input, you can use data/generate_peaks.py to generate those peak annotations. Usage:
Python generate_peaks.py [gencode_path] [genome_sizes_file] [bed_file] [outfile] [narrowPeak_path] [blacklist_file] [label_path] [label_peak_path] [genome_window_size] [flanking]
[gencode_path] [genome_sizes_file] [bed_file] [outfile] are the same as described in the "Genomic annotations" section.
narrowPeak_path is the path you store the ChIP-Seq narrowPeak files. In our experiment, these peaks are the reproducible peaks across pseudo-replicates that pass the 10% IDR threshold. The peaks are provided in the narrowPeak format. For each cell type and TF pair, the file should be named as ChIPseq.[train_cell].[tf_name].conservative.train.narrowPeak.gz'
blacklist_file
is the file you want to exclude during analysis and can be empty if you do not wish to miss any regions
label_path
is where the label files are stored. They should be located in data_dir/label/train/
label_peak_path
is where the output annotation files are stored. They should be located in data_dir/label/train_positive/
genome_window_size
is the size of each bin in your label files. In our experiment we set it to 200.
flanking
is the length of upstream and downstream region that around your training sample. In our experiment we set it to 401.
In additional to the input feature information mentioned in Prepare input features for training and prediction, you need to specify your target region for prediction. The region to predict should be a bed format file indicate the region you wish to predict along the chromosome. If you use our trained model, we recommend you consider use 200bp for each prediction since the models are trained using this format. Example:
chr1 600 800
chr1 650 850
chr1 700 900
chr1 750 950
chr1 800 1000