-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A general system to store particle metadata #64
Comments
Thank you so much for thinking about this, Aaron. I think this is a great One thing I've been thinking about is how we can think about how different For example, imagine that A, B, and C are all different particles in the We would only like to have A and B share the guide tree, while everyone I'd like a little clarification on Approach 1. You say
Just to make sure, the considerable memory overhead you describe is the On Sun, Dec 9, 2012 at 9:59 AM, Aaron Darling notifications@github.comwrote:
Frederick "Erick" Matsen, Assistant Member |
In approach 1 every particle (or possibly also node or other object) would hold a collection of shared_ptrs to cached values, and yes, this would incur substantial overhead in terms of heap space to store these pointers. and many particles might point to the same cached value. so while this approach will actively delete cached values when they become obsolete it's rather expensive. In your example for the topology, it seems that you want to store metadata keyed on the current forest topology. One way to do this (maybe not the most efficient) would be to use a vector of binary split encodings as the hash key, e.g. a vector< vector< bool > >. the splits would have to be sorted. fundamentally speaking there needs to be some way to create a hash key out of the forest topology -- another way could involve a function that does a forest traversal to compute a hash key. |
This is all totally cool. I do think that we need to make a diagram or something (in the wiki) to I'll have a go at trying to enumerate them:
Anything else? To what extent particles are immutable would seem to impact a number of |
@koadman -- I don't know if my little brain can keep track of anything other than option 1. @cmccoy and I just had a pow-wow about this, and here are a couple of things we thought about.
I think that we might get a lot of mileage out of actually sketching out some use-cases. |
Thanks for raising these discussion points guys. I'm now thinking the following things:
|
Not to interrupt the chain of discussion here, but would it be possible And 1000 maps isn't all that many... |
|
Connor and I just had a discussion about whether it's necessary to store metadata for every generation. I think "maybe not." For use cases, here are some use cases we can think of:
can you think of others? |
My take on metadata is that it is data about the model state. Topology guides, ML estimates of branch length, distance guides, and log likelihood are metadata. Model parameters (e.g. rate) on the other hand are part of the model state itself and not metadata. Under this definition any bits of data that help us calculate likelihoods and/or proposals efficiently would be metadata, anything that goes into our formal definition of likelihood of the data given a model is part of the model. I would prefer to have model components be explicitly named variables in To me it seems like how deeply in the history the metadata gets stored really depends on the type of metadata. For cached likelihoods we probably want these for the entire active history. For proposal distributions we only need the current generation. |
Agreed. (Connor helped me understand). Responding to Connor's note about 1000 per generation, that's also rather optimistic from the particle diversity side of things, because these lineages do tend to coalesce. Storing some particle metadata deeply and others not wouldn't seem too hard for option 1. |
Connor has now implemented topology-guided proposals, which motivates further discussion on this thread. Should I just say "go!" or should we schedule a skype session? |
Metadata about particles is useful and currently includes things such as a cached log likelihood, but could also include a cached ML distance estimate or other information. Currently metadata is stored outside the particle, e.g. in OnlineCalculator and requires the class containing the metadata to be notified of particle deletions so stale cache data can be cleared. Currently this is done by storing a reference to the class that needs to be notified inside the particle, and this approach does not scale to an arbitrary number of metadata values and container classes.
Approach 1:
Store metadata inside the particle itself. An interface to fetch particular bits of metadata needs to be devised. This could be as simple as an unordered_map going from some key type (string? a compiler-mangled class name?) to a base class shared_ptr. The class creating metadata would maintain a set of weak_ptr's to the metadata objects and before accessing a particular piece of metadata would check for metadata validity using weak_ptr::expired(). This allows the metadata to be deleted when particles are deleted without needing to actively notify the class creating metadata.
One disadvantage of this approach is that creating per-particle caches will incur considerable memory overhead.
Approach 2:
Create a global metadata store via a static instance. This would be an unordered_multimap from particle pointer to metadata. e.g.
unordered_multimap< particle*, pair< string, metadata_base* > >
This solves the stale cache problem by allowing a single global metadata store to be notified at time of particle/node/etc deletion. This approach may have more problems with multithreading and concurrency than the first approach.Approach 3:
Maintain metadata inside the generating class as it is currently done, associating the metadata with a weak_ptr to the object on which the metadata is stored. Before accessing the metadata, check for expiration of the weak_ptr. This approach is currently slightly frustrating because hash functions are not defined for std::weak_ptr, apparently for no reason other than lack of time from the c++ steering committee:
http://stackoverflow.com/questions/4750504/why-was-stdhash-not-defined-for-stdweak-ptr-in-c0x
so the hash key could be the memory address
This approach has the advantage of providing a lot of flexibility in designing metadata storage, but the disadvantage that caches might grow quite large because they are not actively trimmed back to surviving particles. Size could be managed with one of the classic MFU-approximation cache strategies, and a generic implementation could be made for arbitrary data/metadata.
The immediately motivating use case is that when node merges are proposed non-uniformly, e.g. using a distance guided approach, the same pair gets proposed many times at each generation. Calculating the ML distance among the pair is rather expensive operation using e.g. 10 iterations but results in much higher particle LL. A cache would allow the ML dist to be calculated once for the node pair and saved rather than recomputed dozens or hundreds of times.
The text was updated successfully, but these errors were encountered: