GitHub - tdhock/neuroblastoma-data

Neuroblastoma data and other supervised penalty learning benchmarks

Citations

This repo provides xz-compressed text/csv files of several benchmark data sets involving censored regression for supervised penalty function learning for changepoint detection,

which was first described in the context of DNA copy number profile data at ICML’13. If you use data/systematic or data/detailed please cite this paper:

@inproceedings{Rigaill2013,
  title={Learning sparse penalties for change-point detection using max margin interval regression},
  author={ Rigaill, Guillem and Hocking, Toby and Vert, Jean-Philippe and Bach, Francis},
  booktitle={Proc. 30th ICML},
  pages={172--180},
  year={2013}
}

we also use supervised penalty function learning for ChIP-seq peak detection, as described in our Bioinformatics’17 paper. If you use any of the other data sets please cite this paper:

@article{Hocking2017,
    author = {Hocking, Toby Dylan and Goerner-Potvin, Patricia and Morin, Andreanne and Shao, Xiaojian and Pastinen, Tomi and Bourque, Guillaume},
    title = "{Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning}",
    journal = {Bioinformatics},
    volume = {33},
    number = {4},
    pages = {491-499},
    year = {2016},
    month = {11},
    issn = {1367-4803},
}

Standard format for each data set

There are two neuroblastoma data sets (also available on the CRAN), and several ChIP-seq data sets (also available from the UCI repository).

Each data set has the following files, with the sequenceID column used to link them, and the following columns:

profiles: raw/noisy sequence data in which to look for changepoints.
- position: position on chromosome (x axis in plot of sequence data)
- signal: noisy data (y axis on plot of sequence data)
outputs: interval of log(penalty) values that achieves min label errors, i.e. all log(penalty) values in (min.log.lambda, max.log.lambda) result in changepoint models with min label errors. This is the output/Y value used in the censored regression problem.
labels: regions which an expert has labeled as containing a known number of changepoints.
- labelStart/labelEnd columns define the region.
- annotation: the expert-provided code word that implies the number of changes in the region (penaltyLearning::change.labels is used to convert to the following columns).
- max.changes/min.changes: the min/max number of changes that must occur in this region, according to the expert who created the label. Number of predicted changes < min.changes is a false negative (fn); max.changes < number of predicted changes is a false positive (fp).
- color: suggested color to use when plotting the label on top of the profile/sequence data.
- possible.fn/possible.fp: 1 if fn/fp possible in this label, 0 otherwise.
errors/evaluation: a description of the number of incorrect labels, as a function of log(penalty) values, can be used to evaluate prediction accuracy in terms of number of incorrectly predicted labels, or area under the ROC curve.
- n.segments: model size (n.segments = n.changes +1).
- possible.fp/fp/possible.fn/fn: possible and predicted false positive/negative labels for this model.
- labels/errors: total labels and number of incorrectly predicted labels (errors = fp + fn).
- min.log.lambda/max.log.lambda: interval of log(penalty) values for which model size = n.segments is selected.
- loss: total squared error
inputs: numeric feature matrix, with one row for each labeled sequenceID, and one column for every feature that should/can be used to predict the log(penalty) value for that sequenceID.

Systematic / original labels

For the neuroblastoma data there are actually two data sets, data/systematic and data/detailed, which correspond to two different labels for the same noisy data sequences.

data/systematic/

The original neuroblastoma data set has 3418 labeled sequences. SequenceIDs are of the form profile_chrZ where profile is the patient ID number (profile.id column in the neuroblastoma data set) and Z = chromosome column in the neuroblastoma data set. Z is in the set {1, 2, 3, 4, 11, 17}, and there is a max of one label per sequence. Three interesting cross-validation experiments / prediction problems, from hard to easy:

Hard: hold out an entire chromosome as a test set, e.g. chr17 is test set, others are train set, e.g. data/systematic/cv/chrom/folds.csv
hold out several profile.id numbers as a test set, train on other profiles. Hard: define test set as all sequenceIDs with more/less than 1000 data points, and train set as all other sequenceIDs, e.g. data/systematic/cv/profileSize/folds.csv. Medium: randomly select a subset of profile.id numbers as a test set, train on other profiles, e.g. data/systematic/cv/profileID/folds.csv.
Easy: randomly select sequenceIDs as a test set, train on other sequenceIDs, e.g. data/systematic/cv/sequenceID/folds.csv.

Data files were derived from

data(neuroblastoma, package="neuroblastoma")
data(neuroblastomaProcessed, package="penaltyLearning")

Detailed

data/detailed/

This data set is also derived from the neuroblastoma data set; it is called “detailed” because, unlike the systematic/original labels, some sequences have more than one label:

data.table::fread("xzcat data/detailed/labels.csv.xz")[, .(
  labels=.N), by=sequenceID][, .(
    sequences=.N), by=list(labels)]

> data.table::fread("xzcat data/detailed/labels.csv.xz")[, .(
+   labels=.N), by=sequenceID][, .(
+     sequences=.N), by=list(labels)]
   labels sequences
1:      1      3342
2:      2       196
3:      3       161
4:      4        19
5:      5         9
6:      6         2
7:      9         1
>

ChIP-seq data sets

feature-learning-benchmark.R copies UCI chipseq data sets into this format, e.g.

Whereas the neuroblastoma data sets (data/detailed and data/systematic) are changepoint detection problems for DNA copy number profiles, these other data sets are ChIP-seq peak detection problems.

5 Nov 2019

demo-folds.R shows that the sequenceIDs in baseline predictions.csv files match the sequenceIDs for the corresponding test fold.

26 Sep 2019

Various data sets have 23 labeled chromosomes with between 11 and 56 labels per chrom.

10 Sep 2019

signal.list.annotation.sets.R makes

> labels.wide
                    label.set  pid.chr 1breakpoint 0breakpoints n.data
   1:      lymphoma.mkatayama  30001.1           0            1   3973
   2:      lymphoma.mkatayama 30001.10           0            2   1683
   3:      lymphoma.mkatayama 30001.11           0            1   2245
   4:      lymphoma.mkatayama 30001.12           0            1   2132
   5:      lymphoma.mkatayama 30001.13           0            1   1006
  ---                                                                 
4468:     medulloblastoma.tdh  20138.2          18            0   5937
4469:     medulloblastoma.tdh  20165.2          20            0   5937
4470: neuroblastoma.chiba.tdh  20004.2          23            0  22215
4471:     medulloblastoma.tdh  20165.9          25            0   3012
4472:    neuroblastoma.dr.tdh  20104.2          26            0 153662
> ggplot()+geom_point(aes(position, logratio), data=data.table(signal.list[["20165.9"]]))
>

15 Aug 2019

figure-2019-08-14-animint.R makes http://jan.ucc.nau.edu/~th798/viz/2019-08-15-GP-sample-selection/

7 Aug 2019

data/detailed2012_11_30/regions.csv.gz

> r[type=="breakpoints", table(annotation)]
annotation
>0breakpoints  0breakpoints   1breakpoint
          224          3591          1109
>

http://members.cbio.mines-paristech.fr/~thocking/neuroblastoma/signal.list.annotation.sets.RData

(objs <- load("signal.list.annotation.sets.RData"))
library(data.table)
size.vec <- sapply(signal.list, nrow)
seq.labels <- do.call(rbind, lapply(names(annotation.sets), function(label.set){
  label.df <- annotation.sets[[label.set]]
  data.table(label.df)[, data.table(
    label.set,
    labels=.N
  ), by=.(pid.chr=paste0(profile.id, ".", chromosome), annotation)]
}))
set.sizes <- seq.labels[, {
  u.ids <- unique(pid.chr)
  u.sizes <- size.vec[paste(u.ids)]
  as.list(quantile(u.sizes, seq(0, 1, l=3)))
}, by=.(label.set)]
set.labels <- dcast(
  seq.labels,
  label.set ~ annotation,
  value.var="labels",
  fun.aggregate=sum)
set.sizes[set.labels, on=.(label.set)][order(`50%`)]

> set.sizes[set.labels, on=.(label.set)][order(`50%`)]

                 label.set    0%   50%   100% 1breakpoint 0breakpoints
1:       neuroblastoma.bac    25   234    657         485         3157
2:            lymphoma.tdh   536  2075  28006          44           52
3:     medulloblastoma.tdh   849  3865 148782        1180          568
4: neuroblastoma.nimblegen  1948  4674   5937           7          181
5: neuroblastoma.chiba.tdh  1589 13306  22215         125           93
6:      lymphoma.mkatayama   533 13502  34629         182           79
7:    neuroblastoma.dr.tdh 24484 89524 153662         537          247
>

http://members.cbio.mines-paristech.fr/~thocking/neuroblastoma/slides-snp6.tgz

22 July 2019

figure-2019-07-22.R makes

19 July 2019

figure-2019-07-19.R makes

29 May 2019

figure-max-auc.R creates http://members.cbio.mines-paristech.fr/~thocking/figure-max-auc/

24 May 2019

figure-max-auc.R creates an interactive data viz that shows the AUC maximization/alignment problem,

accuracy.R computes accuracy.csv files e.g. data/H3K27ac_TDH_some/cv/equal_labels/testFolds/1/randomTrainOrderings/3/models/unreg_linear_2/accuracy.csv

evaluation.R creates data/systematic/evaluation.csv.xz from data/systematic/errors.csv.xz

23 May 2019

Baseline predictions files created via baseline.predictions.R:

e.g. data/systematic/cv/sequenceID/testFolds/4/sampleSelectionGP_SE/5/models/unreg_linear_2/predictions.csv is a CSV data table with one row per test sequenceID and one column for each train set size.

detailed.R creates evaluations/inputs/outputs for detailed data set.

14 May 2019

figure-random-gp-lin.R makes the following figures (lines for median, shaded bands for quartiles).

26 Apr 2019

figure-random-linear-selection.R makes

22 Apr 2019

TODOs:

non-redundant features, data/systematic/nonredundant.csv computed via nonredundant.R
order files for each pair selected at first.
accuracy file, prediction file for bayesian model?
write down legend for baseline models, the suffix integer is the number of features used for prediction:
- baseline_0: features completely ignored, prediction is the best constant value for the train labels.
- unsup_BIC_1: labels completely ignored, prediction is always the BIC penalty = log(number of data points on the sequence).
- unreg_linear_1: labels used to infer slope/weight and intercept/bias in linear model with single feature (same feature as used in BIC penalty), log(penalty_i) = bias + weight * log(log(data_i)).
- unreg_linear_2: same as above but with an additional feature/weight for a variance estimate of the noisy seq data.
- L1reg_linear_117: log(penalty_i) = bias + w^T x_i, with 117 features/weights learned by minimizing a L1 regularized cost function.

17 Apr 2019

figure-baseline.R makes

baseline.R computes baseline.csv accuracy for constant and L1-regularized linear model in random data ordering, several train set sizes. e.g. data/systematic/cv/chrom/testFolds/1/randomTrainOrderings/1/baseline.csv

randomOrderings.R creates 5 random orderings of the train data for each fold, saved in e.g. data/systematic/cv/chrom/testFolds/1/randomTrainOrderings/1/order.csv

16 Apr 2019

cv.R which should creates folds.csv files with train/test splits, e.g. data/systematic/cv/chrom/folds.csv

15 Apr 2019

neuroblastoma.R script creates xz-compressed text files data/*/*.xz from data sets in R packages.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
data_loader		data_loader
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
accuracy.R		accuracy.R
add-all.sh		add-all.sh
add-predictions.sh		add-predictions.sh
baseline.R		baseline.R
baseline.predictions.R		baseline.predictions.R
cv.R		cv.R
demo-folds.R		demo-folds.R
detailed.R		detailed.R
evaluation.R		evaluation.R
feature-learning-benchmark.R		feature-learning-benchmark.R
figure-2019-07-19-accuracy.png		figure-2019-07-19-accuracy.png
figure-2019-07-19.R		figure-2019-07-19.R
figure-2019-07-22-L1reg.png		figure-2019-07-22-L1reg.png
figure-2019-07-22-all.png		figure-2019-07-22-all.png
figure-2019-07-22.R		figure-2019-07-22.R
figure-2019-07-22.png		figure-2019-07-22.png
figure-2019-08-14-animint.R		figure-2019-08-14-animint.R
figure-2019-08-14.R		figure-2019-08-14.R
figure-baseline-lines.png		figure-baseline-lines.png
figure-baseline.R		figure-baseline.R
figure-baseline.png		figure-baseline.png
figure-max-auc.R		figure-max-auc.R
figure-neuroblastomaProcessed-combinations-scatter.png		figure-neuroblastomaProcessed-combinations-scatter.png
figure-random-gp-lin-diff-mean.png		figure-random-gp-lin-diff-mean.png
figure-random-gp-lin-diff-median.png		figure-random-gp-lin-diff-median.png
figure-random-gp-lin-diff.png		figure-random-gp-lin-diff.png
figure-random-gp-lin-mean.png		figure-random-gp-lin-mean.png
figure-random-gp-lin-median.png		figure-random-gp-lin-median.png
figure-random-gp-lin.R		figure-random-gp-lin.R
figure-random-gp-lin.png		figure-random-gp-lin.png
figure-random-linear-selection.R		figure-random-linear-selection.R
figure-random-linear-selection.png		figure-random-linear-selection.png
iregnet.R		iregnet.R
neuroblastoma.R		neuroblastoma.R
nonredundant.R		nonredundant.R
packages.R		packages.R
randomOrderings.R		randomOrderings.R
signal.list.annotation.sets.R		signal.list.annotation.sets.R
slurm-afterok.tmpl		slurm-afterok.tmpl
xgboost.R		xgboost.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citations

Standard format for each data set

Systematic / original labels

Detailed

ChIP-seq data sets

5 Nov 2019

26 Sep 2019

10 Sep 2019

15 Aug 2019

7 Aug 2019

22 July 2019

19 July 2019

29 May 2019

24 May 2019

23 May 2019

14 May 2019

26 Apr 2019

22 Apr 2019

17 Apr 2019

16 Apr 2019

15 Apr 2019

About

Releases

Packages

Contributors 3

Languages

tdhock/neuroblastoma-data

Folders and files

Latest commit

History

Repository files navigation

Citations

Standard format for each data set

Systematic / original labels

Detailed

ChIP-seq data sets

5 Nov 2019

26 Sep 2019

10 Sep 2019

15 Aug 2019

7 Aug 2019

22 July 2019

19 July 2019

29 May 2019

24 May 2019

23 May 2019

14 May 2019

26 Apr 2019

22 Apr 2019

17 Apr 2019

16 Apr 2019

15 Apr 2019

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages