Implement very memory-efficient clique building #143

gaurav · 2023-06-05T14:17:29Z

@jeffhhk has a very memory-efficient divide-and-conquer clique building algorithm in his head. If we incorporate it into Babel, we might not need 500G of memory to run it, which would be nice.

gaurav · 2023-06-07T18:10:38Z

Since Babel intermediate files are in TSV format, @balhoff suggests that Souffle might be a good way to do this. This might also make it easy to add more Boomer-like rules if we want to do that in the future.

@jeffhhk used Chez Scheme (https://cisco.github.io/ChezScheme/) and ChatGPT to come up with divide-and-conquer clique building algorithm as promised. It works roughly like this:

Allocate a chunk of memory for cliqueing. You can adjust this based on what your computer supports.
Set up the memory such that each index is set to itself (x[0] = 0, x[1] = 1, etc.)
Do a coarse hash of every pair of identifiers, by:
Use a hash function (Python has one included) modulo the size of the memory chunk to convert each string into an index.
Implement an algorithm that modifies the memory chunk such that it maintains a clique of hashes, e.g. to add (4, 1) to 4->2->3->5, change 4->1->2->3... -- there are four possible cases to consider here.
Write every pair of identifiers into a file, but add an additional column that identifies the clique. We can identify a clique by arbitrarily picking the smallest index in that clique.
Use /usr/bin/sort to sort this file by its first column. This produces a file containing "supercliques" -- because the hashing is coarse, we will need to split most if not all cliques in this file, identified by a unique clique ID in the first column. After sorting, the file will consist of a series of supercliques one after the other.
Read the sorted file, but this time use a fine hashing algorithm -- I think that means don't do the modulo, or maybe just use a normal hashtable, but I'm not sure. This will take a lot more memory, BUT we know that every clique will be a subclique of the current clique, so we can use as much memory as needed to process a clique, then throw the hashtable away and move on to the next clique.
This will give you a series of cliques with a set of identifiers for each.

jeffhhk · 2023-06-07T19:26:43Z

Well, I wrote some code in ChezScheme and translated it to python with the help of ChatGPT.

Gaurav informed me that the Babel build takes place on a 500GB server primarily because of clique building.

Here is a demo of building cliques on 200M edges in 1GB in python:

https://gist.github.com/jeffhhk/42bc2ea4c263eb4958220522201975ae

Currently, the fine cliques are printed on stdout, which more useful for debugging than for running whole files through it.

The demo prints the cliques as a list of nodes, but it sounds like there is some policy in the Node Normalizer that needs additional context to make policy decisions. @cbizon told me verbally something to the extent that those policy decisions are currently scattered about throughout clique building.

It seems to me that it should be possible to A) decorate each edge with metadata on the way in, and B) run a policy-free algorithm like the one demonstrated and C) emit edges and their metadata belonging to each clique. On that base, it seems reasonable to conjecture that whatever policy decisions needed can ported to postprocessing that output, after the RAM-intensive computation has been completed.

I hope that this kind of change can make it easier to run a Babel build and contribute improvements to enrich its information.

jeffhhk · 2023-09-12T00:45:05Z

Changing the coarse (first) pass of this algorithm to be based on union-find would take the same space and be faster. For your data size, I would guess about 4x faster. Union-find would also benefit the fine pass but not as much.

gaurav added this to the Babel December Release milestone Jun 8, 2023

gaurav modified the milestones: Babel December 2023, Babel February 2024 Oct 17, 2023

gaurav modified the milestones: Babel February 2024, Babel May 2024 Jan 9, 2024

gaurav mentioned this issue May 23, 2024

Develop an RDBMS schema for Babel outputs #281

Open

gaurav mentioned this issue Jun 12, 2024

Replace the in-memory glomming code with DuckDB databases #291

Open

gaurav modified the milestones: Babel May 2024, Babel January 2025 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement very memory-efficient clique building #143

Implement very memory-efficient clique building #143

gaurav commented Jun 5, 2023

gaurav commented Jun 7, 2023

jeffhhk commented Jun 7, 2023

jeffhhk commented Sep 12, 2023

Implement very memory-efficient clique building #143

Implement very memory-efficient clique building #143

Comments

gaurav commented Jun 5, 2023

gaurav commented Jun 7, 2023

jeffhhk commented Jun 7, 2023

jeffhhk commented Sep 12, 2023