Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive compare_batch_size for Resolve Operator #1

Closed
shreyashankar opened this issue Sep 17, 2024 · 3 comments
Closed

Adaptive compare_batch_size for Resolve Operator #1

shreyashankar opened this issue Sep 17, 2024 · 3 comments
Labels
efficiency Making docetl operations run faster good first research issue Good for newcomers who want to get involved in research

Comments

@shreyashankar
Copy link
Collaborator

Current Implementation

Our ResolveOperation class uses a Union-Find (aka Disjoint Set Union) algorithm for grouping similar items efficiently. Here's how it works:

We've got two main data structures:

  1. clusters: A list of sets, each representing a cluster of items.
  2. cluster_map: A dictionary that maps each item to its current cluster representative.

The key functions are:

  • find_cluster(item): Finds the representative of an item's cluster.
  • merge_clusters(item1, item2): When two items match, this merges their clusters. It finds the reps for both items' clusters, then combines the smaller cluster into the larger one for efficiency. The cluster_map gets updated to reflect this merge.

The resolution process goes like this:

  1. Each item starts in its own cluster.
  2. We generate all possible item pairs to compare.
  3. We process these pairs in batches (controlled by compare_batch_size, default is 100).
  4. For each batch:
    a. We use an LLM to do pairwise comparisons and see if items match.
    b. For each matching pair, we call merge_clusters to combine their clusters.
  5. We keep doing this until we've compared all pairs.
  6. Finally, we collect all non-empty clusters as the result.

This approach lets us do efficient, incremental clustering as we find matches, without rebuilding the whole cluster structure after each match. Processing comparisons in batches means we can parallelize LLM calls, which helps with overall performance.

Problem

The fixed compare_batch_size we're using now can lead to some performance issues, especially with large datasets. Here's what can happen:

  1. If the batch size is too small, we end up making too many LLM API calls, which slows things down and can get expensive.
  2. If it's too large, we might overwhelm system memory or hit API rate limits.
  3. Our one-size-fits-all approach doesn't adapt to different dataset sizes or system capabilities.

This lack of flexibility can make execution times unnecessarily slow for large datasets, or lead to inefficient resource use for smaller ones.

Proposed Enhancement

We should automatically configure compare_batch_size based on the number of pairwise comparisons, but only if the user hasn't specified it themselves. This would help optimize performance for datasets of all sizes.

Tasks

  1. Implement a function to calculate an appropriate compare_batch_size based on the number of pairwise comparisons. We need to consider factors like:
    • Total number of comparisons
    • Available system resources (e.g., CPU cores, memory)
    • Typical LLM response times
  2. Modify the execute method to use this function when compare_batch_size isn't user-specified.
  3. Add appropriate logging to let users know what batch size we've automatically selected.
  4. Update our documentation to explain this new adaptive behavior.
  5. Implement unit tests to verify that the automatic configuration works as expected.
  6. (Optional) Consider adding a configuration option to enable/disable this automatic sizing.

Expected Outcome

  • Better performance for large datasets without needing manual tuning.
  • More efficient resource utilization across different hardware setups.
  • Maintained or improved efficiency for smaller datasets.

Additional Considerations

  • We need to make sure the automatic configuration doesn't negatively impact smaller datasets.
  • We should think about setting upper and lower bounds for the batch size to prevent extreme values.
  • It'd be good to evaluate how this affects total execution time and resource usage across various dataset sizes.
@shreyashankar shreyashankar added efficiency Making docetl operations run faster good first research issue Good for newcomers who want to get involved in research labels Sep 17, 2024
@sushruth2003
Copy link
Contributor

Is this available to work on?

@shreyashankar
Copy link
Collaborator Author

Yes!

@shreyashankar
Copy link
Collaborator Author

#128 closes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency Making docetl operations run faster good first research issue Good for newcomers who want to get involved in research
Projects
None yet
Development

No branches or pull requests

2 participants