Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agepr #2

Closed
wants to merge 5 commits into from
Closed

Agepr #2

wants to merge 5 commits into from

Conversation

ramirezfranciscof
Copy link
Owner

This PR incorporates two new tools for graph traversal:

  1. AGE: this is a general purpose tool for graph traversal. It considers aiida nodes and groups as if they were both 'graph nodes' of an 'expanded graph', and generalizes the exploration of it by following certain 'rules'. This 'rules' are defined by using generic querybuilder instances: given an initial set of nodes, this queries will be successively applied on top of the results of the previous one, repeating the whole cycle the desired number of times (which could be 'until no new nodes are added').

  2. traverse_graph: this is a simplified interface to use the AGE to search for aiida nodes and links using a reduced set of customizable rules. Contrary to the AGE, which retains basically the same versatility of the querybuilder and thus allows for very intricate traversals, in the graph traverse you can only specify which type of links will be allowed to be traversed and in which direction.

The function traverse_graph (which uses AGE as its search engine) is now used by the delete, the export and the graph visualization procedures. The changes made are basically:

  1. The delete_nodes function now uses the traverser instead of doing its own search (not a lot of performance gainz).

  2. The retrieve_linked_nodes function (export) now uses the traverser instead of doing its own search (every node added to the set was queried separatedly for ancestors and descendants, now the whole set of new nodes found is queried for more nodes).

  3. The graph class methods recurse_descendants and recurse_ancestors used to work by calling the add_incoming and add_outgoing many times. Now these are all independent and they all call the traverser (nodes were loaded to get their pks, now these are obtained directly from the query projection).

@ramirezfranciscof
Copy link
Owner Author

@ltalirz I've mentioned the actual functions changed for each feature, so you can test those directly if you prefer.

@ltalirz
Copy link

ltalirz commented Dec 17, 2019

Thanks @ramirezfranciscof

Test set: Exporting 67k groups with 5 nodes each (which expands to 1.4M nodes with provenance)

"Old implementation"[1]: ~450 minutes
AGE implementation: 6 minutes

Congratulations, this is a 75x speedup!

[1] This is on a modified version of the progress bar PR, where I've already removed slow-down from the progress bar.
It took 5h 15min to retrieve the linked nodes for 70% of the nodes, then I canceled. I extrapolated from there for the total time.

@ltalirz
Copy link

ltalirz commented Dec 17, 2019

I think this makes cleaning up and merging this PR a priority.

@ramirezfranciscof
Copy link
Owner Author

Test set: Exporting 67k groups with 5 nodes each.

"Old implementation"[1]: ~450 minutes
AGE implementation: 6 minutes

Congratulations, this is a 75x speedup!

Congrats AGEteam! @lekah @zooks97 Tomorrow I will discuss with @ltalirz and @sphuber which would be the best way to clean up the history for better reviewing/merging into develop.

@ltalirz
Copy link

ltalirz commented Dec 18, 2019

@ramirezfranciscof Just to mention that I'm in Sion today, so no need to wait for me

lekah and others added 3 commits December 18, 2019 11:35
The AiiDA Graph Explorer (AGE) is a general purpose tool to perform
graph traversal of AiiDA graphs. It considers AiiDA nodes and groups
(eventually even computers and users) as if they were both 'graph nodes'
of an 'expanded graph', and generalizes the exploration of said graph.
The 'rules' that indicate how to traverse this graph are configured by
using generic querybuilder instances (i.e. with information about the
connections but without specific initial nodes/groups and without any
projections). The initial set of nodes/groups is provided directly to
the rule, which then will perform successive applications of the query,
each on top of the results of the previous one. This cycle is repeated
for a specified number of time, which can be specified to be 'until no
new nodes are found'.

The current implementation works with the following (public) classes:

* Basket: generic container class that can store sets of nodes, groups,
node-node edges (aiida links) and group-node edges. These are the
objects that the rule-objects receive and return.

* UpdateRule: initialized with a querybuilder instance (and optionally a
max number of iterations and the option to track edges), it can then be
run with an initial set of nodes to obtain the result of the accumulated
traversal procedure described by the iterations of the query.

* ReplaceRule: same as the update rule, except that at the end of the
procedure the returned basket contains not the accumulation of the
traversal steps but only the nodes obtained during the last step. This
is rule is not compatible with the 'until no new nodes are found' end
iteration criteria.

* RuleSequence: this can concatenate the application of different rules
(it basically works like an UpdateRule that iterates over a chain of
rules instead of a single querybuilder instance).

* RuleSaveWalkers and RuleSetWalkers: rules that can be provided in a
chain of rules given to a RuleSequence to save a given state of the
current basket (Save) that can later be used to overwrite the content
of said working basket (Set). This is useful in the case where one might
need to do two operations 'in parallel' (i.e. on the same set of nodes)
instead of doing the second on the results of the first one.

Co-Authored-By: ramirezfranciscof <ramirezfranciscof@users.noreply.github.com>
The function traverse_graph works as a simplified interface to interact
with the AGE that also removes the need to manually handle the basket
and the querybuilder instance:

 * The price to pay for hiding the basket is that this function can only be
   used with sets of nodes and links (so, no groups).

 * The price to pay for hiding the querybuilder is that complex traversal
   rules can no longer be specified, the user simply defines which links
   can be traversed and this criteria is applied in every iteration (so one
   can't, in a single call, search only for all called calc nodes of the
   called work nodes of an initial workflow node, as one will also obtain
   the calc nodes directly called by that initial workflow).

The interface receives a set of starting nodes' pks and the full set of
link traversal rules (the definition of those can be found in the module
aiida.common.links), and optionally the number of max iterations (which
by default is None, which means 'until no new nodes are found') and a
boolean that indicates if the links (edges) should be returned.

Co-Authored-By: Leonid Kahle <leonid.kahle@epfl.ch>
The node deletion function now uses the traverse_graph function (with
AGE as main engine) to collect the extra nodes that are needed to keep a
consistent provenance. The procedure is not very different than the one
that was initially implemented so no significant performance improvement
is expected, but this is an important first step to homogenize graph
traversal throughout the whole code.
lekah and others added 2 commits December 18, 2019 14:46
The export function now uses the traverse_graph function (with AGE as
the main engine) to collect the extra nodes that are needed to keep a
consistent provenance. This is performed, more specifically, by the
'retrieve_linked_nodes' function. Whereas previously a different query
was performed for each new node added in the previous query step, this
new implementation should do a single new query for all the nodes that
were added in the previous query step. So these changes are not only
important as a first step to homogenize graph traversal throughout the
whole code: an improvement in the export procedure is expected as well.
The graph visualization feature now uses the traverse_graph function
(with AGE as the main engine) to collect the requested nodes to be
visualized. This was implemented in the methods of the graph class:
previously, `recurse_descendants` and `recurse_ancestors` used to
work by calling `add_incoming` and `add_outgoing` many times, which
in turn have to load nodes during the procedure. Now these are all
independent and they all call the traverse_graph function, so the
information is obtained directly from the query projections and no
nodes are loaded. So these changes are not only important as a first
step to homogenize graph traversal throughout the whole code: an
improvement in the visualization procedure is expected as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants