Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extendable dag #42

Merged
merged 6 commits into from
Dec 5, 2022
Merged

Extendable dag #42

merged 6 commits into from
Dec 5, 2022

Conversation

ognian-
Copy link
Contributor

@ognian- ognian- commented Dec 2, 2022

Adding the ability to extend an existing DAG with additional features. Also includes various improvements in the DAG storage-view design.

@ognian- ognian- requested a review from willdumm December 2, 2022 14:03
Copy link
Contributor

@willdumm willdumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great, to the extent that I can ingest it all at once!

  • I just want to check in again that you have a vision for how this new design will lend itself to an implementation of trimming via DAG views, and especially how such views will restrict the child/parent nodes accessible from each node in the trimmed DAG. I think you've said 'yes' to this question in our meetings.
  • For ease of code reading, could you explain (here or somewhere in docstrings) your naming conventions for template parameters (e.g. Tag, CRTP, etc.)? It all seems sensible, but having a reference would be helpful (and if there already is one, sorry I missed it!)
  • This is a stupid question, but I gather that deduplicate.hpp provides a way to add a node/edge feature which is a pointer to some unique storage, for example for efficient storage of node compact genomes. Could you walk us through an example of how this is used, for an arbitrary feature?
  • Will/does merging use this ^ deduplicate infrastructure?

@ognian-
Copy link
Contributor Author

ognian- commented Dec 3, 2022

Thanks! Here are some general changes before addressing your questions:

  • FeatureReader and FeatureWriter are replaced by FeatureConstView and FeatureMutableView.
  • There is now ExtraFeatureStorage, which is meant for features on nodes/edges, and is an additional DAG-global storage for the feature. It's currently only used for Deduplicate's unique storage.
  • Fundamental storage and operations are now implemented as features too, rather than embedded into the views. That's Neighbors for nodes, Endpoints for edges, and Connections for the DAG.

And answers respectively:

  1. Masking out some of the edges is what will be used for trimming and sampling views. Depending on how the view is expected to be used, there are multiple options with varying space/time trade-offs. In any case only the data needed for the topology itself is stored, all heavy annotations like EdgeMutations/CompactGenome etc. does not require extra space (not even a pointer). The most compact (but slow) way is to store an ordered vector of edge ids, and use binary search to check if they participate in the subset. With this approach the node and edge ids are no longer contiguous, but that shouldn't be an issue. The approach that matches the performance of a regular DAG is to also store Neighbors objects, so the only memory saving benefit is from the annotations. We also have room to experiment with fine tuning in between the two extremes, like bloom filter, perfect hash functions, etc.

  2. Sorry about parameter names, I've planned to name them better, but decided to leave it for a future iteration, together with improving documentation.

DS - DAG Storage
DV - DAG View
NC - Nodes container
EC - Edges container
F - Feature
Fs - Features pack
Tag - Usually equal to Feature, except for the case of Deduplicate<Feature>. It is used for making Deduplicate behave as it's parameter in some places, and differently in others.
CRTP - Curiously recurring template pattern - a useful type of compile-time polymorphism idiom. This lets the base class cast itself to the derived class and access data and functions that are expected to be present on deriving classes.

  1. Deduplicate<Feature> can be used on nodes/edges (not on the DAG itself) for any existing Feature. It stores the unique feature data in a concurrent_ordered_set<Feature>, and on each node/edge it stores a const Feature* pointer. Stored features must be read-only, in order to preserve hashing and equality, and the underlying container ensures pointer stability when adding/removing items. Currently it is an append-only implementation, meaning that if nodes/edges are removed and a unique feature is no longer pointed to, it won't be evicted. Reference counting could be added to solve this. Deduplicate removes all FeatureMutableView functions, and allows only entirely replacing the assigned Feature by operator=(Feature&&) on the node/view. It would be a nice addition to allow Deduplicate to use external storage, so multiple DAGs can share the same pool of unique annotations.

  2. I did have merging use Deduplicate seamlessly for CompactGenome and LeafSet in an earlier design iteration, and will bring it back when adding CladeUnion to LeafSet for issue Counting the number of trees in which each node (or clade) takes part #30.

@willdumm
Copy link
Contributor

willdumm commented Dec 5, 2022

Thanks Ognian! This all looks great to me

@ognian- ognian- merged commit f556f8f into main Dec 5, 2022
@ognian- ognian- deleted the extendable-dag branch December 5, 2022 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants