Agepr #2

ramirezfranciscof · 2019-12-17T15:59:31Z

This PR incorporates two new tools for graph traversal:

AGE: this is a general purpose tool for graph traversal. It considers aiida nodes and groups as if they were both 'graph nodes' of an 'expanded graph', and generalizes the exploration of it by following certain 'rules'. This 'rules' are defined by using generic querybuilder instances: given an initial set of nodes, this queries will be successively applied on top of the results of the previous one, repeating the whole cycle the desired number of times (which could be 'until no new nodes are added').
traverse_graph: this is a simplified interface to use the AGE to search for aiida nodes and links using a reduced set of customizable rules. Contrary to the AGE, which retains basically the same versatility of the querybuilder and thus allows for very intricate traversals, in the graph traverse you can only specify which type of links will be allowed to be traversed and in which direction.

The function traverse_graph (which uses AGE as its search engine) is now used by the delete, the export and the graph visualization procedures. The changes made are basically:

The delete_nodes function now uses the traverser instead of doing its own search (not a lot of performance gainz).
The retrieve_linked_nodes function (export) now uses the traverser instead of doing its own search (every node added to the set was queried separatedly for ancestors and descendants, now the whole set of new nodes found is queried for more nodes).
The graph class methods recurse_descendants and recurse_ancestors used to work by calling the add_incoming and add_outgoing many times. Now these are all independent and they all call the traverser (nodes were loaded to get their pks, now these are obtained directly from the query projection).

ramirezfranciscof · 2019-12-17T16:01:25Z

@ltalirz I've mentioned the actual functions changed for each feature, so you can test those directly if you prefer.

ltalirz · 2019-12-17T16:45:26Z

Thanks @ramirezfranciscof

Test set: Exporting 67k groups with 5 nodes each (which expands to 1.4M nodes with provenance)

"Old implementation"[1]: ~450 minutes
AGE implementation: 6 minutes

Congratulations, this is a 75x speedup!

[1] This is on a modified version of the progress bar PR, where I've already removed slow-down from the progress bar.
It took 5h 15min to retrieve the linked nodes for 70% of the nodes, then I canceled. I extrapolated from there for the total time.

ltalirz · 2019-12-17T18:05:11Z

I think this makes cleaning up and merging this PR a priority.

ramirezfranciscof · 2019-12-17T18:50:16Z

Test set: Exporting 67k groups with 5 nodes each.

"Old implementation"[1]: ~450 minutes
AGE implementation: 6 minutes

Congratulations, this is a 75x speedup!

Congrats AGEteam! @lekah @zooks97 Tomorrow I will discuss with @ltalirz and @sphuber which would be the best way to clean up the history for better reviewing/merging into develop.

ltalirz · 2019-12-18T10:30:00Z

@ramirezfranciscof Just to mention that I'm in Sion today, so no need to wait for me

The AiiDA Graph Explorer (AGE) is a general purpose tool to perform graph traversal of AiiDA graphs. It considers AiiDA nodes and groups (eventually even computers and users) as if they were both 'graph nodes' of an 'expanded graph', and generalizes the exploration of said graph. The 'rules' that indicate how to traverse this graph are configured by using generic querybuilder instances (i.e. with information about the connections but without specific initial nodes/groups and without any projections). The initial set of nodes/groups is provided directly to the rule, which then will perform successive applications of the query, each on top of the results of the previous one. This cycle is repeated for a specified number of time, which can be specified to be 'until no new nodes are found'. The current implementation works with the following (public) classes: * Basket: generic container class that can store sets of nodes, groups, node-node edges (aiida links) and group-node edges. These are the objects that the rule-objects receive and return. * UpdateRule: initialized with a querybuilder instance (and optionally a max number of iterations and the option to track edges), it can then be run with an initial set of nodes to obtain the result of the accumulated traversal procedure described by the iterations of the query. * ReplaceRule: same as the update rule, except that at the end of the procedure the returned basket contains not the accumulation of the traversal steps but only the nodes obtained during the last step. This is rule is not compatible with the 'until no new nodes are found' end iteration criteria. * RuleSequence: this can concatenate the application of different rules (it basically works like an UpdateRule that iterates over a chain of rules instead of a single querybuilder instance). * RuleSaveWalkers and RuleSetWalkers: rules that can be provided in a chain of rules given to a RuleSequence to save a given state of the current basket (Save) that can later be used to overwrite the content of said working basket (Set). This is useful in the case where one might need to do two operations 'in parallel' (i.e. on the same set of nodes) instead of doing the second on the results of the first one. Co-Authored-By: ramirezfranciscof <ramirezfranciscof@users.noreply.github.com>

The function traverse_graph works as a simplified interface to interact with the AGE that also removes the need to manually handle the basket and the querybuilder instance: * The price to pay for hiding the basket is that this function can only be used with sets of nodes and links (so, no groups). * The price to pay for hiding the querybuilder is that complex traversal rules can no longer be specified, the user simply defines which links can be traversed and this criteria is applied in every iteration (so one can't, in a single call, search only for all called calc nodes of the called work nodes of an initial workflow node, as one will also obtain the calc nodes directly called by that initial workflow). The interface receives a set of starting nodes' pks and the full set of link traversal rules (the definition of those can be found in the module aiida.common.links), and optionally the number of max iterations (which by default is None, which means 'until no new nodes are found') and a boolean that indicates if the links (edges) should be returned. Co-Authored-By: Leonid Kahle <leonid.kahle@epfl.ch>

The node deletion function now uses the traverse_graph function (with AGE as main engine) to collect the extra nodes that are needed to keep a consistent provenance. The procedure is not very different than the one that was initially implemented so no significant performance improvement is expected, but this is an important first step to homogenize graph traversal throughout the whole code.

The export function now uses the traverse_graph function (with AGE as the main engine) to collect the extra nodes that are needed to keep a consistent provenance. This is performed, more specifically, by the 'retrieve_linked_nodes' function. Whereas previously a different query was performed for each new node added in the previous query step, this new implementation should do a single new query for all the nodes that were added in the previous query step. So these changes are not only important as a first step to homogenize graph traversal throughout the whole code: an improvement in the export procedure is expected as well.

The graph visualization feature now uses the traverse_graph function (with AGE as the main engine) to collect the requested nodes to be visualized. This was implemented in the methods of the graph class: previously, `recurse_descendants` and `recurse_ancestors` used to work by calling `add_incoming` and `add_outgoing` many times, which in turn have to load nodes during the procedure. Now these are all independent and they all call the traverse_graph function, so the information is obtained directly from the query projections and no nodes are loaded. So these changes are not only important as a first step to homogenize graph traversal throughout the whole code: an improvement in the visualization procedure is expected as well.

lekah and others added 3 commits December 18, 2019 11:35

ramirezfranciscof force-pushed the agepr branch from 260257b to af06731 Compare December 18, 2019 12:07

lekah and others added 2 commits December 18, 2019 14:46

ramirezfranciscof force-pushed the agepr branch from af06731 to 2334972 Compare December 18, 2019 13:49

ramirezfranciscof closed this Dec 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agepr #2

Agepr #2

ramirezfranciscof commented Dec 17, 2019

ramirezfranciscof commented Dec 17, 2019

ltalirz commented Dec 17, 2019 •

edited

Loading

ltalirz commented Dec 17, 2019

ramirezfranciscof commented Dec 17, 2019

ltalirz commented Dec 18, 2019

Agepr #2

Agepr #2

Conversation

ramirezfranciscof commented Dec 17, 2019

ramirezfranciscof commented Dec 17, 2019

ltalirz commented Dec 17, 2019 • edited Loading

ltalirz commented Dec 17, 2019

ramirezfranciscof commented Dec 17, 2019

ltalirz commented Dec 18, 2019

ltalirz commented Dec 17, 2019 •

edited

Loading