Adaptive `compare_batch_size` for Resolve Operator #1

shreyashankar · 2024-09-17T17:52:38Z

Current Implementation

Our ResolveOperation class uses a Union-Find (aka Disjoint Set Union) algorithm for grouping similar items efficiently. Here's how it works:

We've got two main data structures:

clusters: A list of sets, each representing a cluster of items.
cluster_map: A dictionary that maps each item to its current cluster representative.

The key functions are:

find_cluster(item): Finds the representative of an item's cluster.
merge_clusters(item1, item2): When two items match, this merges their clusters. It finds the reps for both items' clusters, then combines the smaller cluster into the larger one for efficiency. The cluster_map gets updated to reflect this merge.

The resolution process goes like this:

Each item starts in its own cluster.
We generate all possible item pairs to compare.
We process these pairs in batches (controlled by compare_batch_size, default is 100).
For each batch:
a. We use an LLM to do pairwise comparisons and see if items match.
b. For each matching pair, we call merge_clusters to combine their clusters.
We keep doing this until we've compared all pairs.
Finally, we collect all non-empty clusters as the result.

This approach lets us do efficient, incremental clustering as we find matches, without rebuilding the whole cluster structure after each match. Processing comparisons in batches means we can parallelize LLM calls, which helps with overall performance.

Problem

The fixed compare_batch_size we're using now can lead to some performance issues, especially with large datasets. Here's what can happen:

If the batch size is too small, we end up making too many LLM API calls, which slows things down and can get expensive.
If it's too large, we might overwhelm system memory or hit API rate limits.
Our one-size-fits-all approach doesn't adapt to different dataset sizes or system capabilities.

This lack of flexibility can make execution times unnecessarily slow for large datasets, or lead to inefficient resource use for smaller ones.

Proposed Enhancement

We should automatically configure compare_batch_size based on the number of pairwise comparisons, but only if the user hasn't specified it themselves. This would help optimize performance for datasets of all sizes.

Tasks

Implement a function to calculate an appropriate compare_batch_size based on the number of pairwise comparisons. We need to consider factors like:
- Total number of comparisons
- Available system resources (e.g., CPU cores, memory)
- Typical LLM response times
Modify the execute method to use this function when compare_batch_size isn't user-specified.
Add appropriate logging to let users know what batch size we've automatically selected.
Update our documentation to explain this new adaptive behavior.
Implement unit tests to verify that the automatic configuration works as expected.
(Optional) Consider adding a configuration option to enable/disable this automatic sizing.

Expected Outcome

Better performance for large datasets without needing manual tuning.
More efficient resource utilization across different hardware setups.
Maintained or improved efficiency for smaller datasets.

Additional Considerations

We need to make sure the automatic configuration doesn't negatively impact smaller datasets.
We should think about setting upper and lower bounds for the batch size to prevent extreme values.
It'd be good to evaluate how this affects total execution time and resource usage across various dataset sizes.

The text was updated successfully, but these errors were encountered:

sushruth2003 · 2024-10-12T23:05:22Z

Is this available to work on?

shreyashankar · 2024-10-13T04:01:58Z

Yes!

shreyashankar · 2024-10-28T02:17:18Z

#128 closes this

shreyashankar added efficiency Making docetl operations run faster good first research issue Good for newcomers who want to get involved in research labels Sep 17, 2024

shreyashankar closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive `compare_batch_size` for Resolve Operator #1

Adaptive `compare_batch_size` for Resolve Operator #1

shreyashankar commented Sep 17, 2024

sushruth2003 commented Oct 12, 2024

shreyashankar commented Oct 13, 2024

shreyashankar commented Oct 28, 2024

Adaptive compare_batch_size for Resolve Operator #1

Adaptive compare_batch_size for Resolve Operator #1

Comments

shreyashankar commented Sep 17, 2024

Current Implementation

Problem

Proposed Enhancement

Tasks

Expected Outcome

Additional Considerations

sushruth2003 commented Oct 12, 2024

shreyashankar commented Oct 13, 2024

shreyashankar commented Oct 28, 2024

Adaptive `compare_batch_size` for Resolve Operator #1

Adaptive `compare_batch_size` for Resolve Operator #1