TODO

* compute covariance of spots across cell type contributions during MCMC and accumulate statistics -> clustering of spots
* perform DE between sets of spots clustered according to MCMC-accumulated covariance and based on uncertainty propagated to spot-level
* improve code for and CLI to freeze the scaling factors
* compute means etc. on chain statistics:
  - point estimates: mean, standard deviation, percentiles
  - interval estimates: equal-tail and highest posterior density (HPD) intervals
  - posterior covariance matrix
  - posterior correlation matrix
  - deviance information criterion (DIC)
* determine and direct acceptance rate of Metropolis-Hastings sampling
* write out predicted means and variances, based on
  - the NB conditional posterior
  - the NM conditional posterior
  - the Poisson conditional posterior
* re-consider split and merge algorithm
* for the splitting / merging:
  - independently also optimize the parameters without splitting / merging
* experiment with splitting a single factor explaining everything
* add flag for consistent contribution marginals and expectation marginals
* contribution initialization is unneeded if Gibbs sampling does that as a first step
* compute perplexity, i.e. validation
* automatically determine single experiments in analyze.r
* store all iterations' parameters;
* compute auto-correlation
* print out acceptance / rejection statistics for MH parts
* test likelihood optimization using ML
* consider hierarchies of factors
* trans-experiment factors:
  - by using data in blocks corresponding to the experiments
  - alternative: make sense of trans-experiment profiles post-analysis
* experiment-wise mix-in of factors
  - like normal factors (with features and weights in the same matrices), but treated differently when sampling theta
* use faster method for sampling from the Poisson distribution
* bring back statistical summary printing for vectors
* templatize scaling variables
* evaluate log-normal Poisson rate modeling
* differential expression analysis
* check the Dirichlet posteriors
  - done! The feature side is as in that publication; the mixing side is different but ultimately analogous; I haven't written this down yet
* analysis script:
  - create barplots for expected reads explained, aside from expected spots explained;
  - also break-down experiment-wise
  - aside from basing on theta, also consider using the assigned contributions
    > do this across iterations, for each in decreasing order (because of non-identifiability due to exchangeability!), coloring the first say in red, the last in black to see whether (and if so, how long) there change in the distribution

implement t1, t2, g1, g2, s1, s2 to set limits for sampling

* experiment-constant factors?
* what would be simpler than this is factors that are zero in the other samples, otherwise free

Hierarchical Dirichlet features
a two-level hierarchy

When sampling a novel type (i.e. one that isn't already present in the global hierarchy) then sample a Dirichlet-multinomial as Dirichlet prior for it, with a similar scaling as the initial factors
currently, the nHDP results in trees with a lot of weight on a single branch; it would be good to get more balanced trees instead!
It would be possible to sample multinomially for all counts of a gene in a spot
Kmeans: use a Gaussian mixture approach, rather than 'hard' K-means

Do bagged updates
Consider re-setting the mixing counts from time to time

Consider performing a Dirichlet process in which the observations are count vectors
implement the HDP

Drop template-based approach in favor of inheritance

* is it possible to benefit even more from sparsity? Consider non-zero values; this might make various things faster

less important:
* collect log-likelihoods and write to file
* note that some sums across factors, genes, and / or spots may yield numbers greater than 2^16; however the count data type uses 32 bit! So this should be unproblematic
* use something like libmagic to figure out file types
* consider using GPU for sampling contributions
* experiment with activating ARMA_NO_DEBUG in types.hpp
* write probabilistic unit tests for sampling routines
* consider using HDF5 for IO
* test possible benefits of JIT compiling with fixed G, S, T
  - the speed-up was marginal in a manual test it went from 2:05 min to 2:00 (round about...). Unfortunately, I forgot whether I used all or only the top 1000 genes; some speed-ups only become apparent with more parameters!
* use profile-generate / profile-use gcc flags
* enforcing medians as alternative to means

* print out some sort of Bayesian summary
 -> percentiles: 0%, 5%, 25%, 50%, 75%, 95%, 100%
 -> mean and variance

If gamma mixture proportions are used then the spot scaling should probably be deactivated

Feature-side hierarchy:
basically, introduce a plate for gene, type, and experiment.
Let it's structure be identical to that of the gene and type plate.
I.e. then we have \lambda_{gst} = \phi_{gt} \phi_{gte} \theta{st} \sigma{s} maybe times \sigma{e}

Terminology: "groups" are what I formerly called "experiments".

It might be nice to turn the experiment-specific hyper-parameters into random variables
This might be feasible because we have so many parameters sharing that parameter.
Exponential distributions could be used, and estimation be performed by MH.

It might also be nice to have the same kind of hierarchical structure on the mixing side
Like that it would be possible for the theta priors to share information across groups.

The same generalized beta prime / compound gamma distributions could be used to sample the 2nd prior of the mixing weights as is used for the 2nd prior of the mixing components.

Resurrect:
 * experiment-specific phi prior estimation
 * experiment scaling

Determine how similar the "real" local gene expression profiles (the component-wise product of local and global phi matrices) really are to the global ones.
Consider using such a similarity measure to determine homogeneity amongst the "real" local profiles, identifying outliers and associating them with other global profiles.

Fix the likelihood calculation of phi's p prior

Write DGE analysis code
 * compare all factors with each other within each experiment
 * requires code to compute the distribution
 * the question of the baselines arises
   - several approaches exist:
     > arg min sum log FC
     > arg min Jensen-Shannon divergence

Verify the implementation of the experiment-dependent baseline feature
i.e.: lamba_gst = phi_gt global_phi_gt experiment_phi_g theta_st sigma_s

Re-implement Dirichlet features and mixing weights

Modularize calling modes:
 * only global
 * only local
 * both
In each case, for the local features, we are NOT learning the priors

Study mean vs CV plots

Make sure the subroutine call in Metropolis Hastings for the prior of theta is inlined!

Simplify template boiler plate code by contracting; i.e. form a template struct that carries the combined information of the two current template arguments, and expect the model / experiment etc template types to be instantiated with bespoke template struct.

It might be good to test for each effective local factor, whether or not it would fit better to other global factors

Add a baseline mixing weight factor?
 * what would be the benefit of that?
   - it would improve likelihood by pushing shared local regulation into the baseline rather than in each of the factor mixing weights.
   - apart from complicated implications on the mixing rate, this should not change much how the weighted theta matrices look
   - one could certainly try doing dim red on matrices that leave out the local baseline weights; but since each experiment's sample factor activity distribution around the baseline might differ due to shifting cell type composition entropy, it's unfortunately also unclear how much this might be of benefit for dimensionality reduction
Then, what about global mixing weights?
Global baseline feature

Ensure that global_theta_prior is off when using Dirichlet mixtures; similarly actually for all theta priors

Add a sampling mode for the end, in which the theta values are randomized, and then contributions and theta values are updated iteratively.
This would be done multiple times, and the resulting theta matrix reported.
Importantly, in doing so, everything feature-side, as well as the spot scaling variables, and global theta priors should not be touched.
When local theta priors are used, then it's debatable whether those should be fixed or also randomized & updated.

Feature hierarchy: tree

Perhaps such tree-shaped feature hierarchies might live in global scope, and one would keep the hierarchical experiment-local feature structure

Global baseline feature?
Sets of features with different priors; ones in which genes can be switched off, and other with broad support

Reconsider order of sampling steps in Experiment.
Could have:
0. baseline prior
1. baseline
2. theta prior
3. theta
4. phi prior
4. phi
Alternatively:
0. theta prior
1. theta
2. baseline prior
3. baseline
4. phi prior
4. phi
Up to now baseline was last, and by moving it forward perhaps its dynamics might be accelerated.

Reconsider using expected values of contributions instead of a Gibbs sample

Reset experiments

Field models
 * difference of log gammas of neighbors as pairwise energy term
 * maybe kernel-based weighted sums over neighborhood's observations and explanation is exactly the way to bring this into a gamma-based formalism

Write out vectors in the same format as matrices - i.e. give a column name

Add global feature updates to experiment resetting

Vectorize the sampling routines to simplify parallelization logic and to thus even out parallel work loads

Try out modified posteriors for baseline features. Pretend there are no local features... -> branch greedy_baseline

It might be beneficial even in single-experiment mode to have local feature baselines.
So perhaps it might be preferable to deactivate the global features for single-experiment mode.

Try out a momentum-enhanced Gibbs sampler.
This would integrate the previous few Gibbs steps difference vectors.

Re-introduce experiment scaling (integrate with spot scaling).

Put a prior on the characteristic length-scale and sample from it's posterior if possible - otherwise do Metropolis-Hastings.

Use one shared, not gene-dependent feature prior.

Try setting higher prior values to the local weights than to the baseline.

Make loading more memory efficient: don't constantly re-allocate.
Perhaps first do a scan for the names, and then load them all in place?
Also: multi-thread!

A GUI gimmick showing current activities
-> this could also generate plots

Reactivate Dirichlet mixtures (need to fix the field posterior reliance on that type)

CLI capabilities to specify shared coordinate systems.
E.g. based on spot names a:23x22x0
Alternatively, or in combination with that: path prefixes, e.g. a:path.tsv, potentially even a:+0.5:path.tsv to specify both a coordinate system and a z-offset
As a bonus, when multiple samples are specified, one could use an automatic z-offset.
Furthermore, linear transforms might also be a nice thing

Predict field on a cube.

Investigate the likelihood riddle further.
Certainly, the missing field contribution is partly to blame.
But this issue is older than the fields.
* just checked the implementation of log_gamma and all its calls - and the implementation and handling of the 2nd argument seem to be all correct
* need to check the calls of gamma_distribution, too
* another thing: the p prior (the 2nd argument) in our case is mostly a negative odds... i'll need to check that it is used correctly in sampling, and likelihood calculation
* does it have to do with how that p prior is used?
The global feature log likelihood decreases because the values become larger

Merging and splitting could be well compatible with underlying factors.
E.g. in the multiple binary contrast setting, one might consider flipping the membership of factors

Need a tool to construct and visualize iso surfaces

Why does usage of fields aparrently entail that weak factors stay away from zero? See below.

The fact that the feature and mixing realizations of the have different sample function types indicates that the two might perhaps best be served with different classes, although possibly derived from the same base type

The problem with the decreasing global feature likelihood seems to be mostly solved (there appears to be some marginal, slow decrease of the global feature likelihood, but it seems to be simultaneously with an equally slow increase of the data likelihood; over all the joint log posterior is perhaps still decreasing but not by much - this all with 20 factors on one data set with the top 1000 genes)
Seems the problem was really caused by the wrong initialization of the contributions_gene_type and contributions_spot_type; as there are unequal numbers of spots and genes they get unequal treameant in the marginalization for the gibbs sampling of the other.

What remains to be resolved is the problem of the non-vanishing tail of the factor activity distribution, see the factor activity barplot.
This should be analyzed with
multiScoop --localthetapriors -t 20 -v --top 1000 --sample contributions,phi,theta,theta_prior
or some thing like that.
(-> feature hierarchy does not seem to have this problem - not true?)
The number of genes used is definitely of importance! For top 1000 genes there is only a linear decrease, but it seems if more genes are used then there is also a much reduced base line.
However, the fields do make a difference... without fields on all genes the tail does vanish; using fields on all genes it doesn't

Also, the results seem to be of rather bad quality with
multiScoop -t 20 -v --top 1000 --sample contributions,phi,theta
check wether this has to do with initialization
-> also seems to be the case for feature hierarchy

Both above problems (non-vanishing tails and weird quality) seem to be preserved with either of the CLI switches --localthetapriors and with also sampling gene priors.

Generally, problems should be analyzed in a minimal setting, i.e. with
multiScoop -t 20 --localthetapriors -v --top 1000 --sample contributions,phi,theta


Try out to inherit the partial models from Matrix


Profile once in a while

Put in warning / fix always putting samples in a joint coordinate system

For flipping factors, consider the following objective functions:
 1. sum of Kullbach-Leibler divergence between global and local profile
 2. likelihood-based; consider the number of gene-type attributed reads locally and globally
In any case, some kind of procedure needs to be derived, as each factor can realistically be compared with each other in O(T^2), but we cannot test all assignments.
 1. Greedy: do the T^2, find the closest, link them, then proceed
 2. Only consider one factor per iteration
   a. consider exchange with one other factor
   a. consider exchange with all other factors
It seems subtracting the local attributions from the global ones and re-attributing the possible link-candidate would favor shattering.


Two avenues to think ahead:
 * factor factorization
 * local factor exchange

Two things to urgently do:
 1. fields
 (1.5 the above factor exchange - if working might polish up the last of the BC data glitches - although potentially they're real; yet again perhaps not since - while repeatedly observed - they're still appearing in an odd spatial patterns across the slices).
 2. filtering


Evaluate likelihood after contributions have been updated, not just at the end of the iteration since by that time the lambda_gst are outdated and don't fit with the updated spot scaling and feature baseline.

Add a guard mechanism against trying to calculate likelihood when the lambdas are out-of-date. Also relevant after loading.

Consider giving the contained classes references to their containing class.

Write command-based CLI, in which the default command is "run". Options are rerun (based on shell call file), continue (possibly on new data), dge, predict.

Split & merge
Why do the fields not exhibit sparcity?

How do we need determine P(x|x*) for Metropolis Hastings steps with succeeding Gibbs sampling?