A general system to store particle metadata #64

koadman · 2012-12-09T17:59:16Z

Metadata about particles is useful and currently includes things such as a cached log likelihood, but could also include a cached ML distance estimate or other information. Currently metadata is stored outside the particle, e.g. in OnlineCalculator and requires the class containing the metadata to be notified of particle deletions so stale cache data can be cleared. Currently this is done by storing a reference to the class that needs to be notified inside the particle, and this approach does not scale to an arbitrary number of metadata values and container classes.

Approach 1:

Store metadata inside the particle itself. An interface to fetch particular bits of metadata needs to be devised. This could be as simple as an unordered_map going from some key type (string? a compiler-mangled class name?) to a base class shared_ptr. The class creating metadata would maintain a set of weak_ptr's to the metadata objects and before accessing a particular piece of metadata would check for metadata validity using weak_ptr::expired(). This allows the metadata to be deleted when particles are deleted without needing to actively notify the class creating metadata.
One disadvantage of this approach is that creating per-particle caches will incur considerable memory overhead.

Approach 2:

Create a global metadata store via a static instance. This would be an unordered_multimap from particle pointer to metadata. e.g. unordered_multimap< particle*, pair< string, metadata_base* > > This solves the stale cache problem by allowing a single global metadata store to be notified at time of particle/node/etc deletion. This approach may have more problems with multithreading and concurrency than the first approach.

Approach 3:

Maintain metadata inside the generating class as it is currently done, associating the metadata with a weak_ptr to the object on which the metadata is stored. Before accessing the metadata, check for expiration of the weak_ptr. This approach is currently slightly frustrating because hash functions are not defined for std::weak_ptr, apparently for no reason other than lack of time from the c++ steering committee:
http://stackoverflow.com/questions/4750504/why-was-stdhash-not-defined-for-stdweak-ptr-in-c0x
so the hash key could be the memory address

This approach has the advantage of providing a lot of flexibility in designing metadata storage, but the disadvantage that caches might grow quite large because they are not actively trimmed back to surviving particles. Size could be managed with one of the classic MFU-approximation cache strategies, and a generic implementation could be made for arbitrary data/metadata.

The immediately motivating use case is that when node merges are proposed non-uniformly, e.g. using a distance guided approach, the same pair gets proposed many times at each generation. Calculating the ML distance among the pair is rather expensive operation using e.g. 10 iterations but results in much higher particle LL. A cache would allow the ML dist to be calculated once for the node pair and saved rather than recomputed dozens or hundreds of times.

The text was updated successfully, but these errors were encountered:

matsen · 2012-12-09T20:59:18Z

Thank you so much for thinking about this, Aaron. I think this is a great
idea. In fact, we are going to need something like this for the guide tree
data structure.

One thing I've been thinking about is how we can think about how different
strategies are associated with different equivalence classes that we would
like to associate with the particles.

For example, imagine that A, B, and C are all different particles in the
usual sense, but A and B are equivalent topologically.

We would only like to have A and B share the guide tree, while everyone
else might have something having to do with ML branch lengths.

I'd like a little clarification on Approach 1. You say

One disadvantage of this approach is that creating per-particle caches
will incur considerable memory overhead.

Just to make sure, the considerable memory overhead you describe is the
collection of weak_ptrs that would be held by every particle?

On Sun, Dec 9, 2012 at 9:59 AM, Aaron Darling notifications@github.comwrote:

Metadata about particles is useful and currently includes things such as a
cached log likelihood, but could also include a cached ML distance estimate
or other information. Currently metadata is stored outside the particle,
e.g. in OnlineCalculator and requires the class containing the metadata to
be notified of particle deletions so stale cache data can be cleared.
Currently this is done by storing a reference to the class that needs to be
notified inside the particle, and this approach does not scale to an
arbitrary number of metadata values and container classes.

Approach 1:

Store metadata inside the particle itself. An interface to fetch
particular bits of metadata needs to be devised. This could be as simple as
an unordered_map going from some key type (string? a compiler-mangled class
name?) to a base class shared_ptr. The class creating metadata would
maintain a set of weak_ptr's to the metadata objects and before accessing a
particular piece of metadata would check for metadata validity using
weak_ptr::expired(). This allows the metadata to be deleted when particles
are deleted without needing to actively notify the class creating metadata.
One disadvantage of this approach is that creating per-particle caches
will incur considerable memory overhead.

Approach 2:

Create a global metadata store via a static instance. This would be an
unordered_multimap from particle pointer to metadata. e.g. unordered_multimap<
particle_, pair< string, metadata_base_ > > This solves the stale cache
problem by allowing a single global metadata store to be notified at time
of particle/node/etc deletion. This approach may have more problems with
multithreading and concurrency than the first approach.

Approach 3:

Maintain metadata inside the generating class as it is currently done,
associating the metadata with a weak_ptr to the object on which the
metadata is stored. Before accessing the metadata, check for expiration of
the weak_ptr. This approach is currently slightly frustrating because hash
functions are not defined for std::weak_ptr, apparently for no reason other
than lack of time from the c++ steering committee:

http://stackoverflow.com/questions/4750504/why-was-stdhash-not-defined-for-stdweak-ptr-in-c0x
so the hash key could be the memory address

This approach has the advantage of providing a lot of flexibility in
designing metadata storage, but the disadvantage that caches might grow
quite large because they are not actively trimmed back to surviving
particles. Size could be managed with one of the classic MFU-approximation
cache strategies, and a generic implementation could be made for arbitrary

data/metadata.

The immediately motivating use case is that when node merges are proposed
non-uniformly, e.g. using a distance guided approach, the same pair gets
proposed many times at each generation. Calculating the ML distance among
the pair is rather expensive operation using e.g. 10 iterations but results
in much higher particle LL. A cache would allow the ML dist to be
calculated once for the node pair and saved rather than recomputed dozens
or hundreds of times.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/64.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

koadman · 2012-12-10T05:02:47Z

In approach 1 every particle (or possibly also node or other object) would hold a collection of shared_ptrs to cached values, and yes, this would incur substantial overhead in terms of heap space to store these pointers. and many particles might point to the same cached value. so while this approach will actively delete cached values when they become obsolete it's rather expensive.
approach #3 is similar in that classes creating metadata keep a record of it, but the particles themselves do not. in approach 3 the weak_ptr stored in the class creating metadata points to a particle rather than to the metadata record itself. this approach follows the current assumption that once created, particles are immutable. approach 1 could in principle permit modification of particles, but determining what part of the cached metadata needs to be invalidated might be difficult. if we want this approach we could employ a 'dirty' flag or something similar to suggest that cached values need to be re-evaluated.

In your example for the topology, it seems that you want to store metadata keyed on the current forest topology. One way to do this (maybe not the most efficient) would be to use a vector of binary split encodings as the hash key, e.g. a vector< vector< bool > >. the splits would have to be sorted. fundamentally speaking there needs to be some way to create a hash key out of the forest topology -- another way could involve a function that does a forest traversal to compute a hash key.

matsen · 2012-12-11T14:17:30Z

This is all totally cool.

I do think that we need to make a diagram or something (in the wiki) to
show how all of these techniques fit together, and which are compatible
with which.

I'll have a go at trying to enumerate them:

topology/distance guided proposal densities
particle "recombination" exchanging trees with identical taxon sets in a
forest
re-optimize or MCMC on branch lengths of previously merged particles
the various strategies for associating metadata with particles.

Anything else?

To what extent particles are immutable would seem to impact a number of
these.

matsen · 2012-12-13T17:00:10Z

@koadman -- I don't know if my little brain can keep track of anything other than option 1.

@cmccoy and I just had a pow-wow about this, and here are a couple of things we thought about.

The only way that 1 is a waste is if there is a lot of shared data (modulo the size difference between a large hash map and a collection of little ones). Could we consider a way to explicitly address this problem? For example, we could have data that we know is going to be per-particle just as part of the object (such as the particle likelihood) and everything else in a Chinese-restaurant style pointer to a shared pool of maps. This latter thing would be if particles A, B, C all have the same topology, then they all point to a map that contains everything that is the same for particles that have identical topologies.
Connor also pointed out that in OCaml this would be easy-- one would simply use a variant type. People have wanted such a thing in C++, of course, and here are some ideas, including some Boost classes:
http://stackoverflow.com/questions/208959/c-variant
Connor also asked if the space required for this metadata would actually be big compared to a likelihood vector, which is something we are already storing.

I think that we might get a lot of mileage out of actually sketching out some use-cases.

koadman · 2012-12-13T19:26:26Z

Thanks for raising these discussion points guys. I'm now thinking the following things:

Option 3 could be a strict subset of the functionality in Option 1. That is, it is like Option 1 without any metadata being stored in the primary data object (particle, node, rate, etc). It would be a strict subset if we ignore functionality to prune back the caches when they grow large.
For Option 1, the amount of storage overhead is probably not larger than per-node likelihood vectors but there is no strict requirement to maintain likelihood vectors for every node and as we scale to larger problem sizes it would probably become intractable to do that. Likelihood vectors are roughly 8 bytes * Nsites * Nrates * alphabet_size, e.g. for a typical 400 aa protein with 4 rate categories we are looking at 250Kb. A hash table with a few values would probably occupy no more than 1Kb even with malloc bookkeeping overheads.
boost::variant looks like it requires knowledge of all possible types assumed by a variable at compile time, and every time we had a new type of metadata to cache we would have to add it to a list in state.h. What about boost::any?
I'm not sure I follow the shared map idea... Surely the metadata itself can be shared among equivalent particles/nodes/rates etc. I'm imagining there will be several metadata types, some of which could be shared (e.g. topology) even when others aren't (e.g. branch-length-dependent log likelihoods). Would this be implemented using a copy-on-write strategy or something?

matsen · 2012-12-13T21:52:46Z

Not to interrupt the chain of discussion here, but would it be possible
to store extra extra particle information only for those particles that
get sampled? If we don't do that, then even the overhead of dealing with
all of these pointers will be substantial, even in the best case. If we
do that, then it seems to me that we could be rather lazy in efficient
implementation. If we ever have 1000 particles after resampling, I'll go
jump in Lake Union with my clothes on.

And 1000 maps isn't all that many...

cmccoy · 2012-12-13T21:57:20Z

And 1000 maps isn't all that many.

But worst case is 1000 maps per generation since we store the
predecessors, no?

matsen · 2012-12-13T22:05:21Z

Connor and I just had a discussion about whether it's necessary to store metadata for every generation. I think "maybe not."

For use cases, here are some use cases we can think of:

model parameters
topology guides
ML-estimated branch length

can you think of others?

koadman · 2012-12-13T23:43:02Z

My take on metadata is that it is data about the model state. Topology guides, ML estimates of branch length, distance guides, and log likelihood are metadata. Model parameters (e.g. rate) on the other hand are part of the model state itself and not metadata. Under this definition any bits of data that help us calculate likelihoods and/or proposals efficiently would be metadata, anything that goes into our formal definition of likelihood of the data given a model is part of the model.

I would prefer to have model components be explicitly named variables in state or a derived class.

To me it seems like how deeply in the history the metadata gets stored really depends on the type of metadata. For cached likelihoods we probably want these for the entire active history. For proposal distributions we only need the current generation.

matsen · 2012-12-14T00:35:50Z

Agreed. (Connor helped me understand).

Responding to Connor's note about 1000 per generation, that's also rather optimistic from the particle diversity side of things, because these lineages do tend to coalesce.

Storing some particle metadata deeply and others not wouldn't seem too hard for option 1.

matsen · 2013-01-10T22:22:03Z

Connor has now implemented topology-guided proposals, which motivates further discussion on this thread. Should I just say "go!" or should we schedule a skype session?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A general system to store particle metadata #64

A general system to store particle metadata #64

koadman commented Dec 9, 2012

matsen commented Dec 9, 2012

data/metadata.

koadman commented Dec 10, 2012

matsen commented Dec 11, 2012

matsen commented Dec 13, 2012

koadman commented Dec 13, 2012

matsen commented Dec 13, 2012

cmccoy commented Dec 13, 2012

matsen commented Dec 13, 2012

koadman commented Dec 13, 2012

matsen commented Dec 14, 2012

matsen commented Jan 10, 2013

A general system to store particle metadata #64

A general system to store particle metadata #64

Comments

koadman commented Dec 9, 2012

matsen commented Dec 9, 2012

data/metadata.

koadman commented Dec 10, 2012

matsen commented Dec 11, 2012

matsen commented Dec 13, 2012

koadman commented Dec 13, 2012

matsen commented Dec 13, 2012

cmccoy commented Dec 13, 2012

matsen commented Dec 13, 2012

koadman commented Dec 13, 2012

matsen commented Dec 14, 2012

matsen commented Jan 10, 2013