For reproducibility purposes a Makefile has been included. Typing the following commands should download/clone this repo, then re-create the figures from the paper.
git clone https://github.com/tdhock/changepoint-data-structure
cd changepoint-data-structure
rm -f *.rds *.csv
make
Result figures will be compiled into figures.pdf which has corresponding source file figures.tex.
- Figure 1: demonstration of iterations 2 and 3 of proposed algorithm.
- Figure 2: empirical number of iterations.
- Figure 3 left: empirical timings for linear/quadratic algorithms, right: empirical timings for linear/quadratic/grid search.
- Figure 4: empirical timings in different simulations.
- Figure 5: accuracy in 4-fold cross-validation experiments.
- Figure 6: timing comparison with CROPS and PDPA.
- Figure 7: Kmeans, PCA, Regression.
Note that each figure script linked above depends on csv/rds data files which are pre-computed/cached by other R scripts (see details below or in Makefile). To redo all computations from scratch you can delete these files via “rm *.csv *.rds” and then re-run “make”.
Also note there is a packages.R script which takes care of installing/attaching packages, and is run at the start of each of these R scripts.
Efficient data structure for computing Optimal Partitioning models.
figure-four-models.R and figure-three-iterations-new.R have some ideas for a figure for this.
Basic structure is a binary search tree, probably best to implement on top of a C++ STL map.
See for related work http://wcipeg.com/wiki/Convex_hull_trick#Fully_dynamic_variant
Assume data set of size d.
Assume we have a solver OP(lambda) -> segments, loss. O(d log d)
Let N be the number of elements in the data structure.
Public Methods
- getModelForPenalty(lambda): find l1<=lambda<l2 in O(log N). If l1==lambda then return stored segments for l1 in O(1). If stored segments2=segments1-1 then return segments in O(1). Otherwise run OP(lambda) and store result for lambda in O(1).
Private Methods
- insertModel(lambda, segments, loss): O(log N) – useful if we have a bunch of pre-computed models on disk.
Application: compute the full path on the benchmark data, see if the min error that we get from the target interval algo is the same as the actual min error. Analyze computation time: using data structure we should be log-linear rather than quadratic.
Application: compare O(N^2) Segment Neighborhood algo (jointseg::Fpsn) with our algo. 1-we can do parallel eval with OP. 2- only 73% of SN models are actually computable via OP. 3-modelSelection code is linear time.
Parallel full path algo.
figure-crops-compare-data.R makes figure-crops-compare-data.rds
figure-crops-compare.R makes
??? these files are old…
figure-regression-simple-data.R computes figure-regression-simple-data.csv
figure-regression-simple.R makes
figure-pca-simple-data.R computes figure-pca-simple-data.csv
figure-pca-simple.R makes
figure-kmeans-simple-data.R makes figure-kmeans-simple-data.csv
figure-kmeans-simple.R makes
Advanced R chapter on Rcpp shows the following example of an exported Rcpp function that returns a C++ std::map,
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::map<double, int> tableC(NumericVector x) {
std::map<double, int> counts;
int n = x.size();
for (int i = 0; i < n; i++) {
counts[x[i]]++;
}
return counts;
}
but what is returned to R in this case?
Rcpp modules vignette explains how to expose a C++ class/methods to R (implemented internally using an external pointer to an instance of the class). Get started via
Rcpp::Rcpp.package.skeleton("testmod", module=TRUE)
figure-three-iterations-new.R makes figure-three-iterations-new.tex TODO highlight sure/unsure regions and stuff that is stored by algo. see also figure-four-models.R
- binseg.timing.R simulations.
- fullpath.db.binseg.R binseg on loss values from simulated and real data.
- figure-fullpath-db-binseg.R makes
figure-chipseq-cv.R makes
figure-fullpath-grid-timing.R makes
figure-fullpath-db-timing.R makes
figure-loss-small-evals.tex and figures.pdf
no.intermediates.selected.R exhibits a set of valid loss/complexity values for which no intermediates are selected – how many pops does this cause?
loss.small.R computes full path of loss values for all 13,000+ neuroblastoma data sets with less than 1000 data points.
figure-loss-small.R visualizes the corresponding model selection functions. viz
figure-loss-small-data.R also shows the data sets and segmentation models. viz