Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement very memory-efficient clique building #143

Open
gaurav opened this issue Jun 5, 2023 · 3 comments
Open

Implement very memory-efficient clique building #143

gaurav opened this issue Jun 5, 2023 · 3 comments

Comments

@gaurav
Copy link
Collaborator

gaurav commented Jun 5, 2023

@jeffhhk has a very memory-efficient divide-and-conquer clique building algorithm in his head. If we incorporate it into Babel, we might not need 500G of memory to run it, which would be nice.

@gaurav
Copy link
Collaborator Author

gaurav commented Jun 7, 2023

Since Babel intermediate files are in TSV format, @balhoff suggests that Souffle might be a good way to do this. This might also make it easy to add more Boomer-like rules if we want to do that in the future.

@jeffhhk used Chez Scheme (https://cisco.github.io/ChezScheme/) and ChatGPT to come up with divide-and-conquer clique building algorithm as promised. It works roughly like this:

  1. Allocate a chunk of memory for cliqueing. You can adjust this based on what your computer supports.
  2. Set up the memory such that each index is set to itself (x[0] = 0, x[1] = 1, etc.)
  3. Do a coarse hash of every pair of identifiers, by:
  4. Use a hash function (Python has one included) modulo the size of the memory chunk to convert each string into an index.
  5. Implement an algorithm that modifies the memory chunk such that it maintains a clique of hashes, e.g. to add (4, 1) to 4->2->3->5, change 4->1->2->3... -- there are four possible cases to consider here.
  6. Write every pair of identifiers into a file, but add an additional column that identifies the clique. We can identify a clique by arbitrarily picking the smallest index in that clique.
  7. Use /usr/bin/sort to sort this file by its first column. This produces a file containing "supercliques" -- because the hashing is coarse, we will need to split most if not all cliques in this file, identified by a unique clique ID in the first column. After sorting, the file will consist of a series of supercliques one after the other.
  8. Read the sorted file, but this time use a fine hashing algorithm -- I think that means don't do the modulo, or maybe just use a normal hashtable, but I'm not sure. This will take a lot more memory, BUT we know that every clique will be a subclique of the current clique, so we can use as much memory as needed to process a clique, then throw the hashtable away and move on to the next clique.
  9. This will give you a series of cliques with a set of identifiers for each.

@jeffhhk
Copy link

jeffhhk commented Jun 7, 2023

Well, I wrote some code in ChezScheme and translated it to python with the help of ChatGPT.

Gaurav informed me that the Babel build takes place on a 500GB server primarily because of clique building.

Here is a demo of building cliques on 200M edges in 1GB in python:

https://gist.github.com/jeffhhk/42bc2ea4c263eb4958220522201975ae

Currently, the fine cliques are printed on stdout, which more useful for debugging than for running whole files through it.

The demo prints the cliques as a list of nodes, but it sounds like there is some policy in the Node Normalizer that needs additional context to make policy decisions. @cbizon told me verbally something to the extent that those policy decisions are currently scattered about throughout clique building.

It seems to me that it should be possible to A) decorate each edge with metadata on the way in, and B) run a policy-free algorithm like the one demonstrated and C) emit edges and their metadata belonging to each clique. On that base, it seems reasonable to conjecture that whatever policy decisions needed can ported to postprocessing that output, after the RAM-intensive computation has been completed.

I hope that this kind of change can make it easier to run a Babel build and contribute improvements to enrich its information.

@gaurav gaurav added this to the Babel December Release milestone Jun 8, 2023
@jeffhhk
Copy link

jeffhhk commented Sep 12, 2023

Changing the coarse (first) pass of this algorithm to be based on union-find would take the same space and be faster. For your data size, I would guess about 4x faster. Union-find would also benefit the fine pass but not as much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants