Supervised peak detection benchmark data set
This repo contains code for creating a benchmark data set for predicting peaks in epigenomic data. It has been submitted to the UCI Machine Learning Repository and is now online as the chipseq data set.
From a machine learning perspective, the main interest/challenge is designing a loss function and algorithm that exploits the structure of the weak labels. For more details about the weak labels please read the Relevant Information below.
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00439/peak-detection-data.tar.xz
tar xvf peak-detection-data.tar.xz
Download peak-detection-data.tar.xz which is a 35GB xz-compressed
tar-file. After downloading and un-tar-ing (perhaps using the commands
above), you should have a data
directory.
ChIP-seq experiments characterize protein modifications or binding at specific genomic locations in specific samples. The machine learning problem in these data is structured binary classification / changepoint detection.
Toby Dylan Hocking toby.hocking@mail.mcgill.ca McGill University
> length(Sys.glob("data/*/samples/*/*/problems/*"))
[1] 4960
>
> data.table(namedCapture::str_match_named(unique(basename(Sys.glob("data/*/samples/*/*/problems/*"))), "(?<chrom>chr[^:]+):(?<chromStart>[0-9]+)-(?<chromEnd>[0-9]+)", list(chromStart=as.integer, chromEnd=as.integer)))[, quantile(chromEnd-chromStart)]
0% 25% 50% 75% 100%
220313 9057730 26144352 48590557 115591997
>
These data are significant because they are among the first to provide labels that formalize the genome-wide peak detection problem, which is a very important problem for biomedical / epigenomics researchers. These labels can be used to train and test supervised peak detection algorithms, as explained below.
The data are in problem directories such as
data/<SET>/samples/<GROUP>/<SAMPLE>/problems/<PROBLEM>
Each problem directory contains two files, labels.bed (weak labels) and coverage.bedGraph.gz (inputs).
Each coverage.bedGraph.gz file represents a vector of non-negative integer count data, one entry for each genomic position in a subset of the human genome hg19. For example
data/H3K9me3_TDH_BP/samples/tcell/ERS358697/problems/chr8:48135599-86500000/coverage.bedGraph.gz
represents a vector defined on all genomic positions from 48135600 to 86500000 on chr8 (for a particular tcell sample named ERS358697, in the H3K9me3_TDH_BP data set). To save disk space the vectors are saved using a run-length encoding; for example the first three lines of this file are
chr8 48135599 48135625 0 chr8 48135625 48135629 1 chr8 48135629 48135632 2
which mean that the first 26 entries of the vector are 0, the next four entries are 1, and the following three entries are 2. Note that start positions are 0-based but end positions are 1-based, so the first line means a 0 from all positions from 48135600 to 48135625 (excluding the start position 48135599 for which we have no information).
The goal is to learn a function that takes the coverage.bedGraph.gz file as input, and outputs a binary classification for every genomic position. The positive class represents peaks (typically large counts) and the negative class represents background noise (typically small counts).
Weak labels are given in labels.bed files, each of which indicates several regions of the genome with or without peaks. For example the file
data/H3K4me3_XJ_immune/samples/bcell/McGill0091/problems/chr1:30028082-103863906/labels.bed
contains the 6 labels below:
chr1 33111786 33114894 noPeaks chr1 33114941 33116174 peakStart chr1 33116183 33116620 peakEnd chr1 33116633 33116755 noPeaks chr1 33116834 33118135 peaks chr1 33118161 33120163 noPeaks
The four labels are interpreted as follows:
noPeaks: all of the predictions in this region should be negative / background noise. For example the first line in the file above means that for a vector x_i of count data from i=30028083 to i=103863906, the desired function should predict negative / background noise f(x_i)=0 from i=33111787 to i=33114894. If positive / peaks are predicted f(x_i)=1 for any i in this region, that is counted as a false positive label.
peakStart: there should be exactly one peak start predicted in this region. A peak start is defined as a position i such that a peak is predicted there f(x_i)=1 but not at the previous position f(xi-1)=0. The exact position is unspecified; any position is fine, as long as there is only one start in the region. Predicting exactly one peak start in this region results in a true positive (and true negative). More starts is a false positive, and fewer starts is a false negative. For example,
[peakStart] 0 0 0 1 1 1 1 -> correct. 0 0 1 1 1 1 1 -> also correct. 0 0 0 0 0 0 0 -> false negative (no peak starts). 0 0 1 0 1 1 1 -> flase positive (two peak starts).
peakEnd: there should be exactly one peak end predicted in this region. A peak end is defined as a position i such that a peak is predicted there f(x_i)=1 but not at the next position f(xi+1)=0. The exact position is unspecified; any position is fine, as long as there is only one end in the region. Predicting exactly one peak end in this region results in a true positive (and true negative). More ends is a false positive, and fewer ends is a false negative. For example,
[ peakEnd ] 1 1 1 1 0 0 0 -> correct. 1 1 1 1 1 0 0 -> also correct. 0 0 0 0 0 0 0 -> false negative (no peak ends). 1 1 1 0 1 0 0 -> flase positive (two peak ends).
peaks: there should be at least one peak predicted somewhere in this region (anywhere is fine). Zero predicted peaks in this region is a false negative. If there is a predicted peak somewhere in this region that is a true positive.
For a particular set of predicted peaks f(x), the total number of incorrect labels (false positives + false negatives) can be computed as an evaluation metric (smaller is better). Typically the peak predictions are also stored using a run-length encoding; the error rates can be computed using the reference implementation in R package PeakError, https://github.com/tdhock/PeakError
Receiver Operating Characteristic curves can be computed for a family of predicted peaks f_lambda(x), where lambda is some significance threshold, intercept parameter, etc. Compute the TPR and FPR as follows:
TPR = (total number of true positives)/(total number of labels that could have a true positive) = ( number of peaks labels with at least one overlapping predicted peak + number of peakStart/peakEnd labels with at least one predicted start/end )/(number of peaks, peakStart, peakEnd labels)
FPR = (total number of false positives)/(total number of labels that could have a false positive) = ( number of peakStart/End labels with two or more predicted starts/end + number of noPeaks labels with overlapping predicted peaks )/(number of peakStart, peakEnd, and noPeaks labels)
Suggested fold ID numbers for four-fold cross-validation experiments can be found in data/*/folds.csv files. For example data/H3K36me3_TDH_other/folds.csv contains
problem,fold chr16:8686921-32000000,1 chr16:60000-8636921,1 chr21:43005559-44632664,2 chr14:19050000-107289540,3 chr15:29209443-77800000,4
which means that problems chr16:8686921-32000000 and chr16:60000-8636921 should be considered fold ID 1, chr21:43005559-44632664 should be considered fold ID 2, etc. This means that for data set H3K36me3_TDH_other, the fold ID 2 consists of all data in data/H3K36me3_TDH_other/samples/*/*/problems/chr21:43005559-44632664 directories.
There are several types of learning settings that could be used with these data. Here are four examples.
Unsupervised learning. Train models only using the coverage.bedGraph.gz files. Only use the labels for evaluation (not for training model parameters).
Supervised learning. Train models only using the coverage.bedGraph.gz and labels.bed files in the train set. Use the labels in the test set to evaluate prediction accuracy.
Semi-supervised learning. Train models using the coverage.bedGraph.gz and labels.bed files in the train set. You can additionally use the coverage.bedGraph.gz files in the test set at training time. Use the labels in the test set to evaluate prediction accuracy.
Multi-task learning. Many data sets come from different experiment types, so have different peak patterns. For example H3K4me3_TDH_immune is a H3K4me3 histone modification (sharp peak pattern) and H3K36me3_TDH_immune is a H3K36me3 histone modification (broad peak pattern). Therefore it is not expected that models should generalize between data sets. However there is something common across data sets in that in each data set, the peak / positive class is large values, wheras the noise / negative class is small values. Therefore multi-task learning may be interesting. To compare a multi-task learning model to a single-task learning model, use the suggested cross-validation fold IDs. For test fold ID 1, train both the multi-task and single-task learning models using all other folds, then make predictions on all data with fold ID 1.
Each attribute is a non-negative integer representing the number DNA sequence reads that has aligned at that particular region of the genome. Larger values are more likely to be peaks / positive, smaller values are more likely to be noise / negative.
The labeling method and details on how to compute the number of incorrect labels is described in:
Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning. Toby Dylan Hocking, Patricia Goerner-Potvin, Andreanne Morin, Xiaojian Shao, Tomi Pastinen, Guillaume Bourque. Bioinformatics, Volume 33, Issue 4, 15 February 2017, Pages 491–499, https://doi.org/10.1093/bioinformatics/btw672
Please cite the Bioinformatics paper above.
The current state-of-the-art on these type of problems is constrained optimal changepoint detection with learned penalty functions, as described in our ICML’15 paper. The changepoint detection method described in that paper is the Constrained Dynamic Programming Algorithm which is quadratic time so is too slow for these large data sets (the largest coverage.bedGraph.gz file has 11,499,958 lines). A faster alternative is PeakSegPipeline::PeakSegFPOP_disk which implements the log-linear time algorithm described in arXiv:1703.03352 (it computes the most likely peak positions for a given penalty parameter). Typical unsupervised methods (which do not use the labels) for choosing the penalty parameter are theoretically-motivated penalties (AIC/BIC) or cross-validation. The current methods for learning the penalty function use pre-defined features as inputs (number of data points, mean, variance estimates, quantiles, etc) and the predicted penalty is the output. This data set may be useful to explore feature learning methods such as deep convolutional neural networks, which could be used to relax the assumption of pre-defined features. The interest of this data set is that it has a large number of labels (e.g. 15961 labels in the H3K27ac-H3K4me3_TDHAM_BP data set), which is an order of magnitude more labels than other benchmark data sets of this type (e.g. 3418 labels in the neuroblastoma data set). For more info about supervised changepoint detection see My useR2017 tutorial.
To compare against this baseline, the following files are included in this repository:
- labeled_problems_features.csv: pre-defined features to be used as inputs in the machine learning problem.
- labeled_problems_targets.csv: outputs for the machine learning problem: target interval of log(penalty) values which achieves min incorrect labels. The goal is to learn a function that inputs the vector of features and outputs a value in this interval. These outputs can be used to compute a simple prediction error metric (number of incorrectly predicted intervals).
- labeled_problems_errors.csv: used to compute a more relevant prediction error metric, the number of incorrectly predicted labels (and ROC-AUC). For each problem the table gives the number of false positives (fp), false negatives (fn), and total incorrect labels (errors) for intervals of log(penalty) values (min.log.penalty, max.log.penalty). For example the first row is
ATAC_JV_adipose/samples/AC1/MSC77/problems/chr10:18024675-38818835,0,-Inf,2.05953368773019,6,0,6
and should be interpreted in the following way:
- for the problem ATAC_JV_adipose/samples/AC1/MSC77/problems/chr10:18024675-38818835
- if your function predicts a value between -Inf and 2.05953368773019
- then the predicted peaks are given by running PeakSegFPOP with penalty=0
- which yields 6 fp, 0 fn, and 6 errors.
- labeled_problems_possible_errors.csv contains the total number of labels, for computing test error/accuracy rates and Receiver Operating Characteristic (ROC) curves. If there is no model/parameter with peaks in all positive labels, then the ROC curve does not have a point at FPR=TPR=1, so it is suggested to use linear extrapolation between that point and the point in the ROC space which corresponds to the model with the most predicted peaks. If there is a model with non-hierarchical peaks (e.g. there is a parameter lambda_1 which yields 1 peak, and there is a parameter lambda_2 which yields 2 peaks which are both different from the lambda_1 peak) then the ROC curve may be non-monotonic. In that case it is suggested to compute AUC using an algorithm that works for general polygons, for example geometry::polyarea in R.
- labeled_problems_AUC.R contains R code for computing K-fold CV test error/accuracy/AUC, given the folds defined in the data set (e.g. data/ATAC_JV_adipose/folds.csv). To use this script, first save predicted log(penalty) values in the labeled_problems_pred directory. Each model should be saved as a separate csv file, with two columns: prob.dir and pred.log.lambda. For example this repo has two unsupervised baselines: (1) labeled_problems_pred/AIC.csv is the constant AIC baseline, and (2) labeled_problems_pred/BIC.csv is the penalty=log(number of data points) baseline. Running the R script will create a labeled_problems_pred_error/MODEL.csv file with one line for every (data set, fold) combination. Columns include test error/accuracy/FPR/TPR and test AUC.
- labeled_problems_plot_test_accuracy.R is an R script that can be used to plot such test error metrics.
- labeled_problems_folds.csv is a copy of the folds defined in the data set. (included to be able to compute K-fold CV test error/accuracy/AUC metrics for baselines, even if the full data set is not available)
- labeled_problems_pred_IntervalRegressionCV.R computes L1-regularized linear model predictions, saving them to labeled_problems_pred/IntervalRegressionCV.csv
- labeled_problems_pred_BestConstant.R computes model predictions for the best constant penalty, based on the penalty error functions in the training data. Predictions saved to labeled_problems_pred/BestConstant.csv
- labeled_problems_pred_MultiTaskIRCV.R computes a linear multi-task learning model, by creating indicator variables for each task. Predictions saved to labeled_problems_pred/MultiTaskIRCV.csv
The figure below shows prediction accuracy and AUC, using the designated four-fold cross-validation scheme. It shows room for improvement in penalty function learning algorithms:
- learned linear penalty functions (IntervalRegressionCV, MultiTaskIRCV) have similar test accuracy as the learned constant penalty (BestConstant).
- All models have similar test AUC.
- Deep Convnet with spatial pyramid pooling, in order to train the model using inputs of variable sizes (different chrom subsets). https://arxiv.org/abs/1406.4729
- Typical neural network implementations, e.g. scikit-learn, support regression but not censored outputs. We could either (1) convert target intervals into a real-valued output and use existing code, or (2) implement code for a loss function that exploits the structure of the censored outputs.
figure-auc-improved-train-predictions.R makes
auc.improved.R performs the computation using a gradient descent algorithm, and figure-auc-improved-interactive.R visualizes the results:
- improvement starting from best predicted when labels get equal weight.
- improvement starting from best predicted when classes get equal weight.
folds.R implements a randomized heuristic for assigning problems to folds such that there are approximately equal numbers of labels in each fold.
download.R used to download count data bedGraph.gz files, along with labels.bed files (38337 labels total in 5581 problems), for a total of almost 40GB of data. Maybe distribute one file per data set?
> mb[per.set, on=list(set)][order(labels)] megabytes set labels 1: 554 H3K36me3_TDH_other 200 2: 377 H3K36me3_TDH_ENCODE 338 3: 375 H3K4me3_TDH_ENCODE 525 4: 592 H3K27me3_RL_cancer 570 5: 798 H3K27ac_TDH_some 627 6: 906 H3K36me3_TDH_immune 630 7: 296 H3K27me3_TDH_some 696 8: 2407 CTCF_TDH_ENCODE 1378 9: 3223 H3K4me1_TDH_BP 1584 10: 5871 H3K36me3_AM_immune 1743 11: 6407 ATAC_JV_adipose 3241 12: 3017 H3K4me3_PGP_immune 3780 13: 2902 H3K4me3_TDH_immune 3807 14: 5421 H3K27ac-H3K4me3_TDHAM_BP 15961 >