-
Notifications
You must be signed in to change notification settings - Fork 1.3k
LUCENE-10311: remove complex cost estimation and abstraction leakage around it #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…around it Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak. Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!
|
Here's a first stab of what i proposed on #692 You can see how damaging the current cost() implementation is. As followup commits we can add the |
|
That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in |
|
If we want to add the Adding the sugar method is easy, it is just work. |
| sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length); | ||
| final int l; | ||
| if (multivalued) { | ||
| l = dedup(concatenated.array, concatenated.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to throw away this optimisation? we normally know if our data is single or multi-valued so it seems wasteful not to exploit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization doesnt make sense to me. Buffers should only be used for tiny sets (they are very memory expensive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am convinced. Thanks!
| assert counter >= 0; | ||
| final long cost = Math.round(counter / numValuesPerDoc); | ||
| return new BitDocIdSet(bitSet, cost); | ||
| return new BitDocIdSet(bitSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still ned to implement the method estimateCardinality which is the hard bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is difficult, it just requires a little work. I can get to it soon, seems like it should be fun. Ultimately I think it will give us better estimations than what we have today, without all the tangled APIs and abstraction leakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of sampling, thanks
|
I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:
The same for SimpleTextBKDReader#addAll |
Sorry, I'm not so familiar with the code in question. Does it mean we can remove the |
|
I remove the parameter |
|
Oh, but that might still be not correct. The buffers implementation does not grow with unique documents but with every call of BulkAdder#add because it does not discard duplicates. What I did only works if I assume that providing Integer.MAX_VALUE, the builder can add any number of calls to BulkAdder#add. Something is not totally right here as Buffers requires to know how many calls you are gonna be doing to BulkAdder#add and not the number of unique documents you are adding. |
|
There's no way we're allowing more than |
|
seriously, look at when you call 64 bits are not needed. |
|
What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail. #grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here. We can add that if grow is called with Integer.MAX_VALUE, there is not limit to the calls to BulkAdder#addDoc or something along those lines. wdyt? Finally, we might need to modify the |
|
if we add |
|
+1 That would hide the implementation details from users. |
Confusion happens because grow(numAdds) reserves space for you to call add() up to numAdds times. When numAdds exceeds a "threshold" (maxdoc >> 8), we really don't care about big numbers at all: we'll switch to a FixedBitSet(maxDoc) with fixed sized storage, bounded only by maxDoc. But we can just add a one-liner grow(long) that doesn't require the caller to understand any of this, and hide it via implementation detail.
|
@iverase @jpountz I "undrafted" the PR and added a commit with the I also made some minor tweaks to the javadoc to try to simplify the explanation about what the grow parameter means. Again, it is kind of academic when you think about it, values larger than |
| if ((long) totalAllocated + numDocs <= threshold) { | ||
| ensureBufferCapacity(numDocs); | ||
| if ((long) totalAllocated + numAdds <= threshold) { | ||
| ensureBufferCapacity(numAdds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not casting numAdds here back to a long again? I am fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I thought this method become private, I find weird to have two methods #grow(long) and #grow(int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find it weird to have a grow(long) at all when no sizes above ~8.3 million matter at all. But I'm trying to compromise here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not casting numAdds here back to a long again? I am fine with it.
come on man, all i did was rename local variable numDocs to numAdds to make things clear.
|
For the record this DocIdSetBuilder.Buffer has been so damaging to our code, insanely, I'm still here trying to calm down the explosion of horribleness caused by it. I opened https://issues.apache.org/jira/browse/LUCENE-10443 as a second pass to this PR to try to really dig into the situation. Personally I am in favor of switching back to SparseFixedBitSet. Sticking with bitsets help defend against so many bad decisions such as bringing |
…d truncates" This reverts commit c64ee66.
|
I reverted adding helper |
|
I reverted the changes to the BKD tree as it is inconsistent with the current AssertingLeafReader implementation. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak.
Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!
#11347