Adaptive compare_batch_size
for Resolve Operator
#1
Labels
efficiency
Making docetl operations run faster
good first research issue
Good for newcomers who want to get involved in research
Current Implementation
Our
ResolveOperation
class uses a Union-Find (aka Disjoint Set Union) algorithm for grouping similar items efficiently. Here's how it works:We've got two main data structures:
clusters
: A list of sets, each representing a cluster of items.cluster_map
: A dictionary that maps each item to its current cluster representative.The key functions are:
find_cluster(item)
: Finds the representative of an item's cluster.merge_clusters(item1, item2)
: When two items match, this merges their clusters. It finds the reps for both items' clusters, then combines the smaller cluster into the larger one for efficiency. Thecluster_map
gets updated to reflect this merge.The resolution process goes like this:
compare_batch_size
, default is 100).a. We use an LLM to do pairwise comparisons and see if items match.
b. For each matching pair, we call
merge_clusters
to combine their clusters.This approach lets us do efficient, incremental clustering as we find matches, without rebuilding the whole cluster structure after each match. Processing comparisons in batches means we can parallelize LLM calls, which helps with overall performance.
Problem
The fixed
compare_batch_size
we're using now can lead to some performance issues, especially with large datasets. Here's what can happen:This lack of flexibility can make execution times unnecessarily slow for large datasets, or lead to inefficient resource use for smaller ones.
Proposed Enhancement
We should automatically configure
compare_batch_size
based on the number of pairwise comparisons, but only if the user hasn't specified it themselves. This would help optimize performance for datasets of all sizes.Tasks
compare_batch_size
based on the number of pairwise comparisons. We need to consider factors like:execute
method to use this function whencompare_batch_size
isn't user-specified.Expected Outcome
Additional Considerations
The text was updated successfully, but these errors were encountered: