GitHub - tdhock/feature-learning-benchmark

Supervised peak detection benchmark data set

This repo contains code for creating a benchmark data set for predicting peaks in epigenomic data. It has been submitted to the UCI Machine Learning Repository and is now online as the chipseq data set.

From a machine learning perspective, the main interest/challenge is designing a loss function and algorithm that exploits the structure of the weak labels. For more details about the weak labels please read the Relevant Information below.

Download instructions

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00439/peak-detection-data.tar.xz
tar xvf peak-detection-data.tar.xz

Download peak-detection-data.tar.xz which is a 35GB xz-compressed tar-file. After downloading and un-tar-ing (perhaps using the commands above), you should have a data directory.

Abstract <200 characters

ChIP-seq experiments characterize protein modifications or binding at specific genomic locations in specific samples. The machine learning problem in these data is structured binary classification / changepoint detection.

Source

Toby Dylan Hocking toby.hocking@mail.mcgill.ca McGill University

Number of instances

> length(Sys.glob("data/*/samples/*/*/problems/*"))
[1] 4960
>

Number of attributes

> data.table(namedCapture::str_match_named(unique(basename(Sys.glob("data/*/samples/*/*/problems/*"))), "(?<chrom>chr[^:]+):(?<chromStart>[0-9]+)-(?<chromEnd>[0-9]+)", list(chromStart=as.integer, chromEnd=as.integer)))[, quantile(chromEnd-chromStart)]
       0%       25%       50%       75%      100% 
   220313   9057730  26144352  48590557 115591997 
>

Relevant Information

These data are significant because they are among the first to provide labels that formalize the genome-wide peak detection problem, which is a very important problem for biomedical / epigenomics researchers. These labels can be used to train and test supervised peak detection algorithms, as explained below.

The data are in problem directories such as

data/<SET>/samples/<GROUP>/<SAMPLE>/problems/<PROBLEM>

Each problem directory contains two files, labels.bed (weak labels) and coverage.bedGraph.gz (inputs).

Each coverage.bedGraph.gz file represents a vector of non-negative integer count data, one entry for each genomic position in a subset of the human genome hg19. For example

data/H3K9me3_TDH_BP/samples/tcell/ERS358697/problems/chr8:48135599-86500000/coverage.bedGraph.gz

represents a vector defined on all genomic positions from 48135600 to 86500000 on chr8 (for a particular tcell sample named ERS358697, in the H3K9me3_TDH_BP data set). To save disk space the vectors are saved using a run-length encoding; for example the first three lines of this file are

chr8	48135599	48135625	0
chr8	48135625	48135629	1
chr8	48135629	48135632	2

which mean that the first 26 entries of the vector are 0, the next four entries are 1, and the following three entries are 2. Note that start positions are 0-based but end positions are 1-based, so the first line means a 0 from all positions from 48135600 to 48135625 (excluding the start position 48135599 for which we have no information).

The goal is to learn a function that takes the coverage.bedGraph.gz file as input, and outputs a binary classification for every genomic position. The positive class represents peaks (typically large counts) and the negative class represents background noise (typically small counts).

Weak labels are given in labels.bed files, each of which indicates several regions of the genome with or without peaks. For example the file

data/H3K4me3_XJ_immune/samples/bcell/McGill0091/problems/chr1:30028082-103863906/labels.bed

contains the 6 labels below:

chr1	33111786	33114894	noPeaks
chr1	33114941	33116174	peakStart
chr1	33116183	33116620	peakEnd
chr1	33116633	33116755	noPeaks
chr1	33116834	33118135	peaks
chr1	33118161	33120163	noPeaks

The four labels are interpreted as follows:

noPeaks: all of the predictions in this region should be negative / background noise. For example the first line in the file above means that for a vector x_i of count data from i=30028083 to i=103863906, the desired function should predict negative / background noise f(x_i)=0 from i=33111787 to i=33114894. If positive / peaks are predicted f(x_i)=1 for any i in this region, that is counted as a false positive label.

peakStart: there should be exactly one peak start predicted in this region. A peak start is defined as a position i such that a peak is predicted there f(x_i)=1 but not at the previous position f(x_i-1)=0. The exact position is unspecified; any position is fine, as long as there is only one start in the region. Predicting exactly one peak start in this region results in a true positive (and true negative). More starts is a false positive, and fewer starts is a false negative. For example,

 [peakStart] 
0 0 0 1 1 1 1 -> correct.  
0 0 1 1 1 1 1 -> also correct.  
0 0 0 0 0 0 0 -> false negative (no peak starts).
0 0 1 0 1 1 1 -> flase positive (two peak starts).

peakEnd: there should be exactly one peak end predicted in this region. A peak end is defined as a position i such that a peak is predicted there f(x_i)=1 but not at the next position f(x_i+1)=0. The exact position is unspecified; any position is fine, as long as there is only one end in the region. Predicting exactly one peak end in this region results in a true positive (and true negative). More ends is a false positive, and fewer ends is a false negative. For example,

 [ peakEnd ] 
1 1 1 1 0 0 0 -> correct.  
1 1 1 1 1 0 0 -> also correct.  
0 0 0 0 0 0 0 -> false negative (no peak ends).
1 1 1 0 1 0 0 -> flase positive (two peak ends).

peaks: there should be at least one peak predicted somewhere in this region (anywhere is fine). Zero predicted peaks in this region is a false negative. If there is a predicted peak somewhere in this region that is a true positive.

For a particular set of predicted peaks f(x), the total number of incorrect labels (false positives + false negatives) can be computed as an evaluation metric (smaller is better). Typically the peak predictions are also stored using a run-length encoding; the error rates can be computed using the reference implementation in R package PeakError, https://github.com/tdhock/PeakError

Receiver Operating Characteristic curves can be computed for a family of predicted peaks f_lambda(x), where lambda is some significance threshold, intercept parameter, etc. Compute the TPR and FPR as follows:

TPR = (total number of true positives)/(total number of labels that could have a true positive) = ( number of peaks labels with at least one overlapping predicted peak + number of peakStart/peakEnd labels with at least one predicted start/end )/(number of peaks, peakStart, peakEnd labels)

FPR = (total number of false positives)/(total number of labels that could have a false positive) = ( number of peakStart/End labels with two or more predicted starts/end + number of noPeaks labels with overlapping predicted peaks )/(number of peakStart, peakEnd, and noPeaks labels)

Suggested fold ID numbers for four-fold cross-validation experiments can be found in data/*/folds.csv files. For example data/H3K36me3_TDH_other/folds.csv contains

problem,fold
chr16:8686921-32000000,1
chr16:60000-8636921,1
chr21:43005559-44632664,2
chr14:19050000-107289540,3
chr15:29209443-77800000,4

which means that problems chr16:8686921-32000000 and chr16:60000-8636921 should be considered fold ID 1, chr21:43005559-44632664 should be considered fold ID 2, etc. This means that for data set H3K36me3_TDH_other, the fold ID 2 consists of all data in data/H3K36me3_TDH_other/samples/*/*/problems/chr21:43005559-44632664 directories.

There are several types of learning settings that could be used with these data. Here are four examples.

Unsupervised learning. Train models only using the coverage.bedGraph.gz files. Only use the labels for evaluation (not for training model parameters).

Supervised learning. Train models only using the coverage.bedGraph.gz and labels.bed files in the train set. Use the labels in the test set to evaluate prediction accuracy.

Semi-supervised learning. Train models using the coverage.bedGraph.gz and labels.bed files in the train set. You can additionally use the coverage.bedGraph.gz files in the test set at training time. Use the labels in the test set to evaluate prediction accuracy.

Multi-task learning. Many data sets come from different experiment types, so have different peak patterns. For example H3K4me3_TDH_immune is a H3K4me3 histone modification (sharp peak pattern) and H3K36me3_TDH_immune is a H3K36me3 histone modification (broad peak pattern). Therefore it is not expected that models should generalize between data sets. However there is something common across data sets in that in each data set, the peak / positive class is large values, wheras the noise / negative class is small values. Therefore multi-task learning may be interesting. To compare a multi-task learning model to a single-task learning model, use the suggested cross-validation fold IDs. For test fold ID 1, train both the multi-task and single-task learning models using all other folds, then make predictions on all data with fold ID 1.

Attribute Information

Each attribute is a non-negative integer representing the number DNA sequence reads that has aligned at that particular region of the genome. Larger values are more likely to be peaks / positive, smaller values are more likely to be noise / negative.

Relevant Papers

The labeling method and details on how to compute the number of incorrect labels is described in:

Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning. Toby Dylan Hocking, Patricia Goerner-Potvin, Andreanne Morin, Xiaojian Shao, Tomi Pastinen, Guillaume Bourque. Bioinformatics, Volume 33, Issue 4, 15 February 2017, Pages 491–499, https://doi.org/10.1093/bioinformatics/btw672

Citation requests

Please cite the Bioinformatics paper above.

Current state-of-the-art

The current state-of-the-art on these type of problems is constrained optimal changepoint detection with learned penalty functions, as described in our ICML’15 paper. The changepoint detection method described in that paper is the Constrained Dynamic Programming Algorithm which is quadratic time so is too slow for these large data sets (the largest coverage.bedGraph.gz file has 11,499,958 lines). A faster alternative is PeakSegPipeline::PeakSegFPOP_disk which implements the log-linear time algorithm described in arXiv:1703.03352 (it computes the most likely peak positions for a given penalty parameter). Typical unsupervised methods (which do not use the labels) for choosing the penalty parameter are theoretically-motivated penalties (AIC/BIC) or cross-validation. The current methods for learning the penalty function use pre-defined features as inputs (number of data points, mean, variance estimates, quantiles, etc) and the predicted penalty is the output. This data set may be useful to explore feature learning methods such as deep convolutional neural networks, which could be used to relax the assumption of pre-defined features. The interest of this data set is that it has a large number of labels (e.g. 15961 labels in the H3K27ac-H3K4me3_TDHAM_BP data set), which is an order of magnitude more labels than other benchmark data sets of this type (e.g. 3418 labels in the neuroblastoma data set). For more info about supervised changepoint detection see My useR2017 tutorial.

To compare against this baseline, the following files are included in this repository:

labeled_problems_features.csv: pre-defined features to be used as inputs in the machine learning problem.
labeled_problems_targets.csv: outputs for the machine learning problem: target interval of log(penalty) values which achieves min incorrect labels. The goal is to learn a function that inputs the vector of features and outputs a value in this interval. These outputs can be used to compute a simple prediction error metric (number of incorrectly predicted intervals).
labeled_problems_errors.csv: used to compute a more relevant prediction error metric, the number of incorrectly predicted labels (and ROC-AUC). For each problem the table gives the number of false positives (fp), false negatives (fn), and total incorrect labels (errors) for intervals of log(penalty) values (min.log.penalty, max.log.penalty). For example the first row is

ATAC_JV_adipose/samples/AC1/MSC77/problems/chr10:18024675-38818835,0,-Inf,2.05953368773019,6,0,6

and should be interpreted in the following way:

for the problem ATAC_JV_adipose/samples/AC1/MSC77/problems/chr10:18024675-38818835
if your function predicts a value between -Inf and 2.05953368773019
then the predicted peaks are given by running PeakSegFPOP with penalty=0
which yields 6 fp, 0 fn, and 6 errors.
labeled_problems_possible_errors.csv contains the total number of labels, for computing test error/accuracy rates and Receiver Operating Characteristic (ROC) curves. If there is no model/parameter with peaks in all positive labels, then the ROC curve does not have a point at FPR=TPR=1, so it is suggested to use linear extrapolation between that point and the point in the ROC space which corresponds to the model with the most predicted peaks. If there is a model with non-hierarchical peaks (e.g. there is a parameter lambda_1 which yields 1 peak, and there is a parameter lambda_2 which yields 2 peaks which are both different from the lambda_1 peak) then the ROC curve may be non-monotonic. In that case it is suggested to compute AUC using an algorithm that works for general polygons, for example geometry::polyarea in R.
labeled_problems_AUC.R contains R code for computing K-fold CV test error/accuracy/AUC, given the folds defined in the data set (e.g. data/ATAC_JV_adipose/folds.csv). To use this script, first save predicted log(penalty) values in the labeled_problems_pred directory. Each model should be saved as a separate csv file, with two columns: prob.dir and pred.log.lambda. For example this repo has two unsupervised baselines: (1) labeled_problems_pred/AIC.csv is the constant AIC baseline, and (2) labeled_problems_pred/BIC.csv is the penalty=log(number of data points) baseline. Running the R script will create a labeled_problems_pred_error/MODEL.csv file with one line for every (data set, fold) combination. Columns include test error/accuracy/FPR/TPR and test AUC.
labeled_problems_plot_test_accuracy.R is an R script that can be used to plot such test error metrics.
labeled_problems_folds.csv is a copy of the folds defined in the data set. (included to be able to compute K-fold CV test error/accuracy/AUC metrics for baselines, even if the full data set is not available)
labeled_problems_pred_IntervalRegressionCV.R computes L1-regularized linear model predictions, saving them to labeled_problems_pred/IntervalRegressionCV.csv
labeled_problems_pred_BestConstant.R computes model predictions for the best constant penalty, based on the penalty error functions in the training data. Predictions saved to labeled_problems_pred/BestConstant.csv
labeled_problems_pred_MultiTaskIRCV.R computes a linear multi-task learning model, by creating indicator variables for each task. Predictions saved to labeled_problems_pred/MultiTaskIRCV.csv

The figure below shows prediction accuracy and AUC, using the designated four-fold cross-validation scheme. It shows room for improvement in penalty function learning algorithms:

learned linear penalty functions (IntervalRegressionCV, MultiTaskIRCV) have similar test accuracy as the learned constant penalty (BestConstant).
All models have similar test AUC.

Ideas/links

Deep Convnet with spatial pyramid pooling, in order to train the model using inputs of variable sizes (different chrom subsets). https://arxiv.org/abs/1406.4729
Typical neural network implementations, e.g. scikit-learn, support regression but not censored outputs. We could either (1) convert target intervals into a real-valued output and use existing code, or (2) implement code for a loss function that exploits the structure of the censored outputs.

18 Nov 2019 AUC improvement

figure-auc-improved-train-predictions.R makes

20 Aug 2019 Min Area Under Min(FP,FN)

auc.improved.R performs the computation using a gradient descent algorithm, and figure-auc-improved-interactive.R visualizes the results:

15 Feb 2018 fold assignment code

folds.R implements a randomized heuristic for assigning problems to folds such that there are approximately equal numbers of labels in each fold.

30 Jan 2018 data set sizes

download.R used to download count data bedGraph.gz files, along with labels.bed files (38337 labels total in 5581 problems), for a total of almost 40GB of data. Maybe distribute one file per data set?

> mb[per.set, on=list(set)][order(labels)]
    megabytes                      set labels
 1:       554       H3K36me3_TDH_other    200
 2:       377      H3K36me3_TDH_ENCODE    338
 3:       375       H3K4me3_TDH_ENCODE    525
 4:       592       H3K27me3_RL_cancer    570
 5:       798         H3K27ac_TDH_some    627
 6:       906      H3K36me3_TDH_immune    630
 7:       296        H3K27me3_TDH_some    696
 8:      2407          CTCF_TDH_ENCODE   1378
 9:      3223           H3K4me1_TDH_BP   1584
10:      5871       H3K36me3_AM_immune   1743
11:      6407          ATAC_JV_adipose   3241
12:      3017       H3K4me3_PGP_immune   3780
13:      2902       H3K4me3_TDH_immune   3807
14:      5421 H3K27ac-H3K4me3_TDHAM_BP  15961
>

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
labeled_problems_pred		labeled_problems_pred
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
auc.improved.R		auc.improved.R
auc.improved.pred.csv		auc.improved.pred.csv
auc.improved.rds		auc.improved.rds
auc.improved.train.R		auc.improved.train.R
auc.improved.train.csv		auc.improved.train.csv
auc.improved.train.predictions.R		auc.improved.train.predictions.R
auc.improved.train.predictions.rds		auc.improved.train.predictions.rds
download.R		download.R
figure-auc-improved-interactive.R		figure-auc-improved-interactive.R
figure-auc-improved-train-predictions.R		figure-auc-improved-train-predictions.R
folds.R		folds.R
labeled_problems_AUC.R		labeled_problems_AUC.R
labeled_problems_AUC_details.R		labeled_problems_AUC_details.R
labeled_problems_check.R		labeled_problems_check.R
labeled_problems_errors.csv		labeled_problems_errors.csv
labeled_problems_features.csv		labeled_problems_features.csv
labeled_problems_folds.csv		labeled_problems_folds.csv
labeled_problems_plot_test_accuracy.R		labeled_problems_plot_test_accuracy.R
labeled_problems_plot_test_accuracy.png		labeled_problems_plot_test_accuracy.png
labeled_problems_possible_errors.R		labeled_problems_possible_errors.R
labeled_problems_possible_errors.csv		labeled_problems_possible_errors.csv
labeled_problems_pred_BestConstant.R		labeled_problems_pred_BestConstant.R
labeled_problems_pred_IntervalRegressionCV.R		labeled_problems_pred_IntervalRegressionCV.R
labeled_problems_pred_MultiTaskConstant.R		labeled_problems_pred_MultiTaskConstant.R
labeled_problems_pred_MultiTaskIRCV.R		labeled_problems_pred_MultiTaskIRCV.R
labeled_problems_pred_UpperBound.R		labeled_problems_pred_UpperBound.R
labeled_problems_targets.csv		labeled_problems_targets.csv
labels.R		labels.R
sizes.R		sizes.R
target.intervals.features.csv		target.intervals.features.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download instructions

Abstract <200 characters

Source

Number of instances

Number of attributes

Relevant Information

Attribute Information

Relevant Papers

Citation requests

Current state-of-the-art

Ideas/links

18 Nov 2019 AUC improvement

20 Aug 2019 Min Area Under Min(FP,FN)

15 Feb 2018 fold assignment code

30 Jan 2018 data set sizes

About

Releases

Packages

Languages

tdhock/feature-learning-benchmark

Folders and files

Latest commit

History

Repository files navigation

Download instructions

Abstract <200 characters

Source

Number of instances

Number of attributes

Relevant Information

Attribute Information

Relevant Papers

Citation requests

Current state-of-the-art

Ideas/links

18 Nov 2019 AUC improvement

20 Aug 2019 Min Area Under Min(FP,FN)

15 Feb 2018 fold assignment code

30 Jan 2018 data set sizes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages