This is the reimplementation of Yuanfang's winning algorithm in the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge
background: ENCODE-DREAM
see also: Yuanfang Guan's 1st Place Solution and Original Code
Please contact (gyuanfan@umich.edu or hyangl@umich.edu) if you have any questions or suggestions.
ANCHOR has 4 major sections and 15 major steps.
-
- Prepare DNase-seq features (35 hours per cell line)
-
- Prepare DNA sequence and TF motif features (75 hours per TF)
-
- Prepare gene location features (8 hours; you only need to run this once)
-
- Make predictions ('fast' mode 9 hours or 'full' mode 36 hours per TF-cell line pair; it needs input features from section 1-3)
If you want to run the 4 major sections separately (e.g. in parallel), you can follow the examples below.
If you want to learn more about the details and run steps invidually, see below: Preprocessing Details
In this section, we preprocess DNase-seq data to perform quantile normalization across multiple cell lines and generate multiple features. For example,
python ANCHOR_dnase.py -cell H1-hESC
You can also modify the input and output directory:
python ANCHOR_dnase.py -i1 DIR_OF_BAM -i2 DIR_OF_BIGWIG -cell H1-hESC -o DIR_OF_FEATURES
For more information, run (python ANCHOR_dnase.py -h):
ANCHOR_dnase - preprocessing DNase-seq data
optional arguments:
-h, --help show this help message and exit
-i1 INPUT1, --input1 INPUT1
Directory of the input DNase-seq read alignment BAM files (default: ./data/dnase_aln/)
-i2 INPUT2, --input2 INPUT2
Directory of the input fold-enrichement signal coverage BigWig files (default: ./data/dnase_fold_coverage/)
-cell CELL_LINE, --cell_line CELL_LINE
The cell line name of the DNase-seq data
-o OUTPUT, --output OUTPUT
Directory of the output files (default: ./data/)
In this section, we scan TF motifs across human DNA geonome sequence to generate multiple features. For example,
python ANCHOR_sequence.py -tf TAF1
You can also modify the input and output directory:
python ANCHOR_sequence.py -tf TAF1 -motif DIR_OF_MOTIF -seq DIR_OF_SEQUENCE -o DIR_OF_FEATURES
For more information, run (python ANCHOR_sequence.py -h):
ANCHOR_sequence - prepare DNA sequence based features
optional arguments:
-h, --help show this help message and exit
-tf TF [TF ...], --tf TF [TF ...]
TFs under consideration (e.g. TAF1 MAX
-motif MOTIF_DIR, --motif_dir MOTIF_DIR
Directory of the sequence motifs (e.g.: ./data/motif/TAF1)
-seq SEQ_DIR, --seq_dir SEQ_DIR
Directory of the human genome sequences (e.g.: ./data/hg_genome/chr1)
-o OUTPUT, --output OUTPUT
Directory of the output files (default: ./data/)
In this section, we calculate the top 20 closest distances to a gene for each genomic coordinates (e.g. chr1 600 800). For example,
python ANCHOR_gencode.py
You can also modify the output file:
python ANCHOR_gencode.py -o DIR_OF_FEATURES
For more information, run (python ANCHOR_gencode.py -h):
ANCHOR_gencode - preprocessing the gencode data
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
The output file (default: ./data/top_20)
In this section, we make predictions using all the features generated in sections 1-3. First, unzip the pre-trained xgboost models:
cd model/anchor/
unzip F.zip
unzip G.zip
unzip H.zip
unzip I.zip
cd ../../
Then you can run ANCHOR predictions by:
python ANCHOR_prediction.py -tf TAF1 -cell H1-hESC
There are two modes (about the sequence feature):
(1) in the "fast" mode, only query TF's sequence features are used
(2) in the "full" mode, multiple TF's sequence features are used
The "full" mode has better performance but requires much more time. You can active full mode by:
python ANCHOR_prediction.py -tf TAF1 -cell H1-hESC -m full
For more information, run (python ANCHOR_prediction.py -h):
ANCHOR - prediction
optional arguments:
-h, --help show this help message and exit
-tf TF, --tf TF The TF name (e.g. TAF1)
-cell CELL_LINE, --cell_line CELL_LINE
The cell line name (e.g. H1-hESC)
-i INPUT_DIR, --input_dir INPUT_DIR
Directory of the features (default: ./data/)
-m MODE, --mode MODE The prediction mode (fast: only use single TF sequence motif features; full: use all 13 TF sequence motif features (warning: need more prediction time!)
The final predictions are saved here:
./prediction/anchor/final/
ANCHOR preprocessing has 15 major steps and requires a lot of time and space, since the original input data themselves are very large.
For experienced users, the code of each step is available and you may run them individually without the ANCHOR_XXX.py interface.
The description and runtime of each step is listed below:
step1. convert bam to txt file; require samtools; # 15-30 minutes perl cell line
perl ./preprocess/dnase/transform_bam_to_track.pl ./data/dnase_aln/ ./data/dnase_track/
step2: sum raw reads from all tech/bio replicates and separate them into 23 chromsomes; 15-20 hours perl cell line
perl ccreate_DNase_avg_track_by_chr.pl H1-hESC ./data/dnase_track/ ./data/dnase_track_avg/
step3: subsample 1/1000 and rank reads of all chrs for the multi-cell quantile normalization; 30-60 minutes perl cell line
perl ./preprocess/dnase/approximate_max_min_median.pl H1-hESC ./data/dnase_track_avg/ ./data/dnase_sort/
step4: resize to the same length as the reference; require cv2.resize; 1-2 minutes perl cell line
python ./preprocess/dnase/resize_dnase_sort.py ./data/dnase_sort/ ./data/dnase_sort_resize/
step5: quantile normalization to the reference cell line; 100-150 minutes perl cell line
the outputs of step2 and step4 are needed; a reference cell line from step4 is needed; by default, we used liver.txt with 3036304 lines as the reference cell line
perl ./preprocess/dnase/normalize_by_anchor.pl H1-hESC ./data/dnase_track_avg/ ./data/dnase_sort_resize/ ./data/dnase_track_avg_anchor/ ./data/ref/liver.txt
step6.1: generate the mean, max, min features; 150 minutes perl cell line
a reference file the genomic coordinates excluding blacklisted regions is needed; this file looks like this:
chr1 600 800
chr1 650 850
chr1 700 900
the blacklist can downloaded from here: https://sites.google.com/site/anshulkundaje/projects/blacklists
perl ./preprocess/dnase/create_DNase_mmm.pl H1-hESC ./data/dnase_track_avg_anchor/ ./data/anchor_bam_dnase/ ./data/anchor_bam_dnase_max_min/ ./data/ref/test_regions.blacklistfiltered.bed
step6.2: generate the delta- mean feature; 40 minutes for the DNase-seq data of 13 cell lines
prepare the anchor_bam_dnase from multiple cell line first - the step caculate the differences across cell lines
perl ./preprocess/dnase/create_DNase_diff.pl ./data/anchor_bam_dnase/ ./data/anchor_bam_dnase_diff/
step6.3: generate the delta- max, min features; 100 minutes for the DNase-seq data of 13 cell lines
prepare the anchor_bam_dnase_max_min from multiple cell line first - the step caculate the differences across cell lines
perl ./preprocess/dnase/create_DNase_max_min_diff.pl ./data/anchor_bam_dnase_max_min/ ./data/anchor_bam_dnase_max_min_diff/
step6.4: generate the 15 neighboring mean features; 30 minutes per cell line
perl ./preprocess/dnase/create_DNase_largespace.pl ./data/anchor_bam_dnase/ ./data/anchor_bam_dnase_largespace/
step6.5: generate the 30 neighboring max and min features; 60 minutes per cell line
perl ./preprocess/dnase/create_DNase_max_min_largespace.pl ./data/anchor_bam_dnase_max_min/ ./data/anchor_bam_dnase_max_min_largespace/
step6.6: generate the 15 neighboring delta-mean features; 30 minutes per cell line
perl ./preprocess/dnase/create_DNase_largespace.pl ./data/anchor_bam_dnase_diff/ ./data/anchor_bam_dnase_diff_largespace/
step6.7: generate the 30 neighboring delta-max and delta-min features; 60 minutes per cell line
perl ./preprocess/dnase/create_DNase_max_min_largespace.pl ./data/anchor_bam_dnase_max_min_diff/ ./data/anchor_bam_dnase_max_min_diff_largespace/
from here fold-enrichement frequency feature 🍉
step7: convert fold-enrichment bigWig to txt; 2 minutes per cell line
perl ./preprocess/dnase/transform_bigwig_to_txt.pl ./data/dnase_fold_coverage/ ./data/dnase_fold_coverage_big/
step8: count the number of signal occurance and ignore values to generate frequency features # 60 minutes per cell line
the reference file of the genomic coordinates excluding blacklisted regions is needed
perl ./preprocess/dnase/produce_frequency.pl H1-hESC ./data/dnase_fold_coverage_big/ ./data/frequency/ ./data/ref/test_regions.blacklistfiltered.bed
step9.1: generate the frequency feature by normalizing the counting number by ranking # 2 minutes per cell line
perl ./preprocess/dnase/produce_frequency_rank.pl ./data/frequency/ ./data/frequency_rank/
step9.2: generate the delta-frequency feature; 50 minutes for the DNase-seq data of 13 cell lines
perl ./preprocess/dnase/create_frequency_diff.pl ./data/frequency_rank/ ./data/frequency_rank_diff/
step9.3: generate the 15 neighboring frequency features; 50 minutes per cell line
perl ./preprocess/dnase/create_frequency_largespace.pl ./data/frequency_rank/ ./data/frequency_rank_largespace/
step9.4: generate the 15 neighboring delta- frequency features; 50 minutes per cell line
perl ./preprocess/dnase/create_frequency_largespace.pl ./data/frequency_rank_diff/ ./data/frequency_rank_diff_largespace/
step10.1: scan TF-motif PWMs against DNA sequences to obtain score for each position; 60-120 minutes per TF per chromosome (23 chrs!)
python ./preprocess/sequence/scan_motif_by_chr.py chr13 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_forward/
step10.2: select top 3 scores over forward 200bp sequences; 150 minutes per TF
perl ./preprocess/sequence/create_motif_top3_ru_forward.pl TAF1 ./data/seq_forward/ ./data/seq_forward_top3/ ./data/ref/test_regions.blacklistfiltered.bed
step11.1: scan TF-motif PWMs against reverse-complement DNA sequences to obtain score for each position; 60-120 minutes per TF per chromosome (23 chrs!)
python ./preprocess/sequence/scan_motif_by_chr_reverse.py chr13 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_reverse/
step11.2: select top 3 scores over reverse 200bp sequences; 150 minutes per TF
perl ./preprocess/sequence/create_motif_top3_ru_forward.pl TAF1 ./data/seq_reverse/ ./data/seq_reverse_top3/ ./data/ref/test_regions.blacklistfiltered.bed
step12: select the top 4 out of the 6 scores; 10 minues per TF; 10 minutes per TF
perl ./preprocess/sequence/find_max_forward_reverse_top_4.pl ./data/seq_forward_top3/ ./data/seq_reverse_top3/ ./data/seq_top4/
step13: rank the features to 0-1; 10 minutes per TF
perl ./preprocess/sequence/create_top4_rank_ru.pl ./data/seq_top4/ ./data/seq_top4_rank/
step14: generate the neighboring ranked features; 80 minutes per TF
perl ./preprocess/sequence/create_largespace_tomax_top4_rank.pl ./data/seq_top4_rank/ ./data/seq_top4_rank_largespace/
step15: for each genomic coordinate, calculate the 20 closest distances to a gene; 8 hours
(1) the gencode (e.g. gencode.v19.annotation.gtf) is required (downloaded from: https://www.gencodegenes.org/releases/19.html)
(2) the reference file the genomic coordinates excluding blacklisted regions is needed
(3) it returns a file 'top_20' - the top 20 closes distances to a gene of each coordinate under consideration
perl ./preprocess/gencode/create_gene_distance_top20_uniq.pl ./data/top_20 ./data/ref/gencode.v19.annotation.gtf ./data/ref/test_regions.blacklistfiltered.bed
Here is the pipeline for model training using the F model (single-motif model) as an example. To train other G/H/I models, simply run the corresponding "xxx_G.pl" scripts in step 3 and 4.
step0. prepare ChIP-seq labels (e.g. E2F1.train.labels.tsv) for model training and put them in ./data/chipseq/
The format of ChIP-seq labels (U/B stands for Unbound/Bound) is tab-separated as follows:
chr | start | stop | GM12878 | HeLa-S3 |
---|---|---|---|---|
chr10 | 600 | 800 | U | U |
chr10 | 650 | 850 | B | B |
chr10 | 700 | 900 | B | U |
chr10 | 750 | 950 | U | U |
step1. subsample labels (keep all Bound intervals and subsample the Unbound intervals to balance the data)
Of note, the default subsampling rate is 1/300, roughly corresponding to the percentage of Bound intervals. You can change the rate at line 41 "$rand=rand(300);" based on your data.
perl ./train/create_sample.pl E2F1
step2. prepare the corresponding features based on the subsampled labels
perl ./train/prepare_data_F.pl E2F1
step3. model training (models of 1000 training iterations and their corresponding AUPRC scores are saved in ./model/F_all/)
cd ./train/F
perl train_model_F.pl E2F1
cd ../../
step4. select the best model from the 1000 iterations based on AUPRC; final models are saved in ./model/F/
perl ./train/find_best_model_F.pl