-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement very memory-efficient clique building #143
Comments
Since Babel intermediate files are in TSV format, @balhoff suggests that Souffle might be a good way to do this. This might also make it easy to add more Boomer-like rules if we want to do that in the future. @jeffhhk used Chez Scheme (https://cisco.github.io/ChezScheme/) and ChatGPT to come up with divide-and-conquer clique building algorithm as promised. It works roughly like this:
|
Well, I wrote some code in ChezScheme and translated it to python with the help of ChatGPT. Gaurav informed me that the Babel build takes place on a 500GB server primarily because of clique building. Here is a demo of building cliques on 200M edges in 1GB in python:
Currently, the fine cliques are printed on stdout, which more useful for debugging than for running whole files through it. The demo prints the cliques as a list of nodes, but it sounds like there is some policy in the Node Normalizer that needs additional context to make policy decisions. @cbizon told me verbally something to the extent that those policy decisions are currently scattered about throughout clique building. It seems to me that it should be possible to A) decorate each edge with metadata on the way in, and B) run a policy-free algorithm like the one demonstrated and C) emit edges and their metadata belonging to each clique. On that base, it seems reasonable to conjecture that whatever policy decisions needed can ported to postprocessing that output, after the RAM-intensive computation has been completed. I hope that this kind of change can make it easier to run a Babel build and contribute improvements to enrich its information. |
Changing the coarse (first) pass of this algorithm to be based on union-find would take the same space and be faster. For your data size, I would guess about 4x faster. Union-find would also benefit the fine pass but not as much. |
@jeffhhk has a very memory-efficient divide-and-conquer clique building algorithm in his head. If we incorporate it into Babel, we might not need 500G of memory to run it, which would be nice.
The text was updated successfully, but these errors were encountered: