Extendable dag #42

ognian- · 2022-12-02T14:03:43Z

Adding the ability to extend an existing DAG with additional features. Also includes various improvements in the DAG storage-view design.

…cleanup

willdumm

This all looks great, to the extent that I can ingest it all at once!

I just want to check in again that you have a vision for how this new design will lend itself to an implementation of trimming via DAG views, and especially how such views will restrict the child/parent nodes accessible from each node in the trimmed DAG. I think you've said 'yes' to this question in our meetings.
For ease of code reading, could you explain (here or somewhere in docstrings) your naming conventions for template parameters (e.g. Tag, CRTP, etc.)? It all seems sensible, but having a reference would be helpful (and if there already is one, sorry I missed it!)
This is a stupid question, but I gather that deduplicate.hpp provides a way to add a node/edge feature which is a pointer to some unique storage, for example for efficient storage of node compact genomes. Could you walk us through an example of how this is used, for an arbitrary feature?
Will/does merging use this ^ deduplicate infrastructure?

ognian- · 2022-12-03T08:32:53Z

Thanks! Here are some general changes before addressing your questions:

FeatureReader and FeatureWriter are replaced by FeatureConstView and FeatureMutableView.
There is now ExtraFeatureStorage, which is meant for features on nodes/edges, and is an additional DAG-global storage for the feature. It's currently only used for Deduplicate's unique storage.
Fundamental storage and operations are now implemented as features too, rather than embedded into the views. That's Neighbors for nodes, Endpoints for edges, and Connections for the DAG.

And answers respectively:

Masking out some of the edges is what will be used for trimming and sampling views. Depending on how the view is expected to be used, there are multiple options with varying space/time trade-offs. In any case only the data needed for the topology itself is stored, all heavy annotations like EdgeMutations/CompactGenome etc. does not require extra space (not even a pointer). The most compact (but slow) way is to store an ordered vector of edge ids, and use binary search to check if they participate in the subset. With this approach the node and edge ids are no longer contiguous, but that shouldn't be an issue. The approach that matches the performance of a regular DAG is to also store Neighbors objects, so the only memory saving benefit is from the annotations. We also have room to experiment with fine tuning in between the two extremes, like bloom filter, perfect hash functions, etc.
Sorry about parameter names, I've planned to name them better, but decided to leave it for a future iteration, together with improving documentation.

DS - DAG Storage
DV - DAG View
NC - Nodes container
EC - Edges container
F - Feature
Fs - Features pack
Tag - Usually equal to Feature, except for the case of Deduplicate<Feature>. It is used for making Deduplicate behave as it's parameter in some places, and differently in others.
CRTP - Curiously recurring template pattern - a useful type of compile-time polymorphism idiom. This lets the base class cast itself to the derived class and access data and functions that are expected to be present on deriving classes.

Deduplicate<Feature> can be used on nodes/edges (not on the DAG itself) for any existing Feature. It stores the unique feature data in a concurrent_ordered_set<Feature>, and on each node/edge it stores a const Feature* pointer. Stored features must be read-only, in order to preserve hashing and equality, and the underlying container ensures pointer stability when adding/removing items. Currently it is an append-only implementation, meaning that if nodes/edges are removed and a unique feature is no longer pointed to, it won't be evicted. Reference counting could be added to solve this. Deduplicate removes all FeatureMutableView functions, and allows only entirely replacing the assigned Feature by operator=(Feature&&) on the node/view. It would be a nice addition to allow Deduplicate to use external storage, so multiple DAGs can share the same pool of unique annotations.
I did have merging use Deduplicate seamlessly for CompactGenome and LeafSet in an earlier design iteration, and will bring it back when adding CladeUnion to LeafSet for issue Counting the number of trees in which each node (or clade) takes part #30.

willdumm · 2022-12-05T16:14:49Z

Thanks Ognian! This all looks great to me

ognian- added 3 commits December 1, 2022 20:46

Added capability to extend a DAG trough a view; Tests passing; needs …

23abf48

…cleanup

Split implementation

23054d8

Final cleanup; ready for PR

d4dbf43

ognian- requested a review from willdumm December 2, 2022 14:03

willdumm approved these changes Dec 2, 2022

View reviewed changes

willdumm and others added 3 commits December 5, 2022 10:19

Add test for MinWeightSample

4eee3c4

Add test file for MinWeightSample

1d186b8

Fix Node/EdgeId comparison operators

90eea0a

ognian- merged commit f556f8f into main Dec 5, 2022

ognian- deleted the extendable-dag branch December 5, 2022 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extendable dag #42

Extendable dag #42

ognian- commented Dec 2, 2022

willdumm left a comment

ognian- commented Dec 3, 2022

willdumm commented Dec 5, 2022

Extendable dag #42

Extendable dag #42

Conversation

ognian- commented Dec 2, 2022

willdumm left a comment

Choose a reason for hiding this comment

ognian- commented Dec 3, 2022

willdumm commented Dec 5, 2022