Neuroblastoma data and other supervised penalty learning benchmarks
This repo provides xz-compressed text/csv files of several benchmark data sets involving censored regression for supervised penalty function learning for changepoint detection,
- which was first described in the context of DNA copy number profile data at ICML’13. If you use data/systematic or data/detailed please cite this paper:
@inproceedings{Rigaill2013,
title={Learning sparse penalties for change-point detection using max margin interval regression},
author={ Rigaill, Guillem and Hocking, Toby and Vert, Jean-Philippe and Bach, Francis},
booktitle={Proc. 30th ICML},
pages={172--180},
year={2013}
}
- we also use supervised penalty function learning for ChIP-seq peak detection, as described in our Bioinformatics’17 paper. If you use any of the other data sets please cite this paper:
@article{Hocking2017,
author = {Hocking, Toby Dylan and Goerner-Potvin, Patricia and Morin, Andreanne and Shao, Xiaojian and Pastinen, Tomi and Bourque, Guillaume},
title = "{Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning}",
journal = {Bioinformatics},
volume = {33},
number = {4},
pages = {491-499},
year = {2016},
month = {11},
issn = {1367-4803},
}
There are two neuroblastoma data sets (also available on the CRAN), and several ChIP-seq data sets (also available from the UCI repository).
Each data set has the following files, with the sequenceID column used to link them, and the following columns:
- profiles: raw/noisy sequence data in which to look for changepoints.
- position: position on chromosome (x axis in plot of sequence data)
- signal: noisy data (y axis on plot of sequence data)
- outputs: interval of log(penalty) values that achieves min label errors, i.e. all log(penalty) values in (min.log.lambda, max.log.lambda) result in changepoint models with min label errors. This is the output/Y value used in the censored regression problem.
- labels: regions which an expert has labeled as containing a known
number of changepoints.
- labelStart/labelEnd columns define the region.
- annotation: the expert-provided code word that implies the number of changes in the region (penaltyLearning::change.labels is used to convert to the following columns).
- max.changes/min.changes: the min/max number of changes that must occur in this region, according to the expert who created the label. Number of predicted changes < min.changes is a false negative (fn); max.changes < number of predicted changes is a false positive (fp).
- color: suggested color to use when plotting the label on top of the profile/sequence data.
- possible.fn/possible.fp: 1 if fn/fp possible in this label, 0 otherwise.
- errors/evaluation: a description of the number of incorrect labels, as a
function of log(penalty) values, can be used to evaluate prediction
accuracy in terms of number of incorrectly predicted labels, or area
under the ROC curve.
- n.segments: model size (n.segments = n.changes +1).
- possible.fp/fp/possible.fn/fn: possible and predicted false positive/negative labels for this model.
- labels/errors: total labels and number of incorrectly predicted labels (errors = fp + fn).
- min.log.lambda/max.log.lambda: interval of log(penalty) values for which model size = n.segments is selected.
- loss: total squared error
- inputs: numeric feature matrix, with one row for each labeled sequenceID, and one column for every feature that should/can be used to predict the log(penalty) value for that sequenceID.
For the neuroblastoma data there are actually two data sets, data/systematic and data/detailed, which correspond to two different labels for the same noisy data sequences.
The original neuroblastoma data set has 3418 labeled sequences. SequenceIDs are of the form profile_chrZ where profile is the patient ID number (profile.id column in the neuroblastoma data set) and Z = chromosome column in the neuroblastoma data set. Z is in the set {1, 2, 3, 4, 11, 17}, and there is a max of one label per sequence. Three interesting cross-validation experiments / prediction problems, from hard to easy:
- Hard: hold out an entire chromosome as a test set, e.g. chr17 is test set, others are train set, e.g. data/systematic/cv/chrom/folds.csv
- hold out several profile.id numbers as a test set, train on other profiles. Hard: define test set as all sequenceIDs with more/less than 1000 data points, and train set as all other sequenceIDs, e.g. data/systematic/cv/profileSize/folds.csv. Medium: randomly select a subset of profile.id numbers as a test set, train on other profiles, e.g. data/systematic/cv/profileID/folds.csv.
- Easy: randomly select sequenceIDs as a test set, train on other sequenceIDs, e.g. data/systematic/cv/sequenceID/folds.csv.
Data files were derived from
data(neuroblastoma, package="neuroblastoma")
data(neuroblastomaProcessed, package="penaltyLearning")
This data set is also derived from the neuroblastoma data set; it is called “detailed” because, unlike the systematic/original labels, some sequences have more than one label:
data.table::fread("xzcat data/detailed/labels.csv.xz")[, .(
labels=.N), by=sequenceID][, .(
sequences=.N), by=list(labels)]
> data.table::fread("xzcat data/detailed/labels.csv.xz")[, .( + labels=.N), by=sequenceID][, .( + sequences=.N), by=list(labels)] labels sequences 1: 1 3342 2: 2 196 3: 3 161 4: 4 19 5: 5 9 6: 6 2 7: 9 1 >
feature-learning-benchmark.R copies UCI chipseq data sets into this format, e.g.
- data/ATAC_JV_adipose/inputs.csv.xz
- data/ATAC_JV_adipose/outputs.csv.xz
- data/ATAC_JV_adipose/evaluation.csv.xz
- data/ATAC_JV_adipose/cv/equal_labels/folds.csv
Whereas the neuroblastoma data sets (data/detailed and data/systematic) are changepoint detection problems for DNA copy number profiles, these other data sets are ChIP-seq peak detection problems.
demo-folds.R shows that the sequenceIDs in baseline predictions.csv files match the sequenceIDs for the corresponding test fold.
Various data sets have 23 labeled chromosomes with between 11 and 56 labels per chrom.
signal.list.annotation.sets.R makes
> labels.wide
label.set pid.chr 1breakpoint 0breakpoints n.data
1: lymphoma.mkatayama 30001.1 0 1 3973
2: lymphoma.mkatayama 30001.10 0 2 1683
3: lymphoma.mkatayama 30001.11 0 1 2245
4: lymphoma.mkatayama 30001.12 0 1 2132
5: lymphoma.mkatayama 30001.13 0 1 1006
---
4468: medulloblastoma.tdh 20138.2 18 0 5937
4469: medulloblastoma.tdh 20165.2 20 0 5937
4470: neuroblastoma.chiba.tdh 20004.2 23 0 22215
4471: medulloblastoma.tdh 20165.9 25 0 3012
4472: neuroblastoma.dr.tdh 20104.2 26 0 153662
> ggplot()+geom_point(aes(position, logratio), data=data.table(signal.list[["20165.9"]]))
>
figure-2019-08-14-animint.R makes http://jan.ucc.nau.edu/~th798/viz/2019-08-15-GP-sample-selection/
data/detailed2012_11_30/regions.csv.gz
> r[type=="breakpoints", table(annotation)]
annotation
>0breakpoints 0breakpoints 1breakpoint
224 3591 1109
>
http://members.cbio.mines-paristech.fr/~thocking/neuroblastoma/signal.list.annotation.sets.RData
(objs <- load("signal.list.annotation.sets.RData"))
library(data.table)
size.vec <- sapply(signal.list, nrow)
seq.labels <- do.call(rbind, lapply(names(annotation.sets), function(label.set){
label.df <- annotation.sets[[label.set]]
data.table(label.df)[, data.table(
label.set,
labels=.N
), by=.(pid.chr=paste0(profile.id, ".", chromosome), annotation)]
}))
set.sizes <- seq.labels[, {
u.ids <- unique(pid.chr)
u.sizes <- size.vec[paste(u.ids)]
as.list(quantile(u.sizes, seq(0, 1, l=3)))
}, by=.(label.set)]
set.labels <- dcast(
seq.labels,
label.set ~ annotation,
value.var="labels",
fun.aggregate=sum)
set.sizes[set.labels, on=.(label.set)][order(`50%`)]
> set.sizes[set.labels, on=.(label.set)][order(`50%`)]
label.set 0% 50% 100% 1breakpoint 0breakpoints
1: neuroblastoma.bac 25 234 657 485 3157
2: lymphoma.tdh 536 2075 28006 44 52
3: medulloblastoma.tdh 849 3865 148782 1180 568
4: neuroblastoma.nimblegen 1948 4674 5937 7 181
5: neuroblastoma.chiba.tdh 1589 13306 22215 125 93
6: lymphoma.mkatayama 533 13502 34629 182 79
7: neuroblastoma.dr.tdh 24484 89524 153662 537 247
>
http://members.cbio.mines-paristech.fr/~thocking/neuroblastoma/slides-snp6.tgz
figure-2019-07-22.R makes
figure-2019-07-19.R makes
figure-max-auc.R creates http://members.cbio.mines-paristech.fr/~thocking/figure-max-auc/
figure-max-auc.R creates an interactive data viz that shows the AUC maximization/alignment problem,
accuracy.R computes accuracy.csv files e.g. data/H3K27ac_TDH_some/cv/equal_labels/testFolds/1/randomTrainOrderings/3/models/unreg_linear_2/accuracy.csv
evaluation.R creates data/systematic/evaluation.csv.xz from data/systematic/errors.csv.xz
Baseline predictions files created via baseline.predictions.R:
e.g. data/systematic/cv/sequenceID/testFolds/4/sampleSelectionGP_SE/5/models/unreg_linear_2/predictions.csv is a CSV data table with one row per test sequenceID and one column for each train set size.
detailed.R creates evaluations/inputs/outputs for detailed data set.
figure-random-gp-lin.R makes the following figures (lines for median, shaded bands for quartiles).
figure-random-linear-selection.R makes
TODOs:
- non-redundant features, data/systematic/nonredundant.csv computed via nonredundant.R
- order files for each pair selected at first.
- accuracy file, prediction file for bayesian model?
- write down legend for baseline models, the suffix integer is the
number of features used for prediction:
- baseline_0: features completely ignored, prediction is the best constant value for the train labels.
- unsup_BIC_1: labels completely ignored, prediction is always the BIC penalty = log(number of data points on the sequence).
- unreg_linear_1: labels used to infer slope/weight and intercept/bias in linear model with single feature (same feature as used in BIC penalty), log(penalty_i) = bias + weight * log(log(data_i)).
- unreg_linear_2: same as above but with an additional feature/weight for a variance estimate of the noisy seq data.
- L1reg_linear_117: log(penalty_i) = bias + w^T x_i, with 117 features/weights learned by minimizing a L1 regularized cost function.
figure-baseline.R makes
baseline.R computes baseline.csv accuracy for constant and L1-regularized linear model in random data ordering, several train set sizes. e.g. data/systematic/cv/chrom/testFolds/1/randomTrainOrderings/1/baseline.csv
randomOrderings.R creates 5 random orderings of the train data for each fold, saved in e.g. data/systematic/cv/chrom/testFolds/1/randomTrainOrderings/1/order.csv
cv.R which should creates folds.csv files with train/test splits, e.g. data/systematic/cv/chrom/folds.csv
neuroblastoma.R script creates xz-compressed text files data/*/*.xz from data sets in R packages.