LUCENE-10311: remove complex cost estimation and abstraction leakage around it #709

rmuir · 2022-02-24T14:59:09Z

Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak.

Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!

#11347

…around it Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak. Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!

rmuir · 2022-02-24T15:01:10Z

Here's a first stab of what i proposed on #692

You can see how damaging the current cost() implementation is.

As followup commits we can add the grow(long) sugar that simply truncates. And we should optimize FixedBitSet.approximateCardinality(). After doing that, we should look around and see if there is any other similar damage to our APIs related to the fact that FixedBitSet had a slow approximateCardinality and fix those, too.

jpountz · 2022-02-24T15:13:04Z

That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in LSBRadixSorter#reorder so it's ok to always deduplicate.

rmuir · 2022-02-24T15:20:58Z

If we want to add the grow(long) sugar method that simply truncates to Integer.MAX_VALUE and clean up all the points callsites, or write a cool FixedBitSet.approximateCardinality, just feel free to push commits here. Otherwise I will get to these two things later and remove draft status on the PR.

Adding the sugar method is easy, it is just work.
Implementing the approximateCardinality requires some thought and prolly some benchmarking. I had in mind to just "sample" some "chunks" of the long[] and sum up Long.bitCount across the ranges. In upcoming JDK this method will get vectorized, let's take advantage of that, so then both cardinality() and approximateCardinality would get faster: openjdk/jdk#6857

iverase · 2022-02-24T15:21:40Z

lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java

        sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length);
-        final int l;
-        if (multivalued) {
-          l = dedup(concatenated.array, concatenated.length);


Do we really want to throw away this optimisation? we normally know if our data is single or multi-valued so it seems wasteful not to exploit it.

This optimization doesnt make sense to me. Buffers should only be used for tiny sets (they are very memory expensive).

Ok, I am convinced. Thanks!

iverase · 2022-02-24T15:26:45Z

lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java

-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);


we still ned to implement the method estimateCardinality which is the hard bit.

I don't think it is difficult, it just requires a little work. I can get to it soon, seems like it should be fun. Ultimately I think it will give us better estimations than what we have today, without all the tangled APIs and abstraction leakage.

I like the idea of sampling, thanks

iverase · 2022-02-24T15:53:29Z

I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:

lucene/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java

Line 562 in 8c67a38

if (grown == false) {

The same for SimpleTextBKDReader#addAll

rmuir · 2022-02-24T19:27:13Z

I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing

Sorry, I'm not so familiar with the code in question. Does it mean we can remove the grown parameter here and the split logic around it for the addAll() method? If so, that sounds great!

iverase · 2022-02-25T08:36:35Z

I remove the parameter grown from addAll in 4c6b436

iverase · 2022-02-25T09:47:51Z

Oh, but that might still be not correct. The buffers implementation does not grow with unique documents but with every call of BulkAdder#add because it does not discard duplicates. What I did only works if I assume that providing Integer.MAX_VALUE, the builder can add any number of calls to BulkAdder#add. Something is not totally right here as Buffers requires to know how many calls you are gonna be doing to BulkAdder#add and not the number of unique documents you are adding.

rmuir · 2022-02-25T13:49:34Z

There's no way we're allowing more than Integer.MAX_VALUE calls going to this buffers thing.

rmuir · 2022-02-25T13:59:00Z

seriously, look at threshold. its maxDoc >>> 7. maxDoc is an int.

when you call grow(anywhere close to Integer.MAX_VALUE) then buffers exits the stage permanently.

64 bits are not needed.

iverase · 2022-02-25T14:09:06Z

What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail.

#grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here.

We can add that if grow is called with Integer.MAX_VALUE, there is not limit to the calls to BulkAdder#addDoc or something along those lines.

wdyt?

Finally, we might need to modify the AssertingLeafReader as it asserts that you call #grow with the number of points you are going to visit, which in this case is not true all the time.

rmuir · 2022-02-26T13:37:25Z

if we add grow(long) that simply truncates and forwards, then it encapsulates this within this class. The code stays simple and the caller doesn't need to know about it.

iverase · 2022-02-27T08:09:05Z

+1 That would hide the implementation details from users.

Confusion happens because grow(numAdds) reserves space for you to call add() up to numAdds times. When numAdds exceeds a "threshold" (maxdoc >> 8), we really don't care about big numbers at all: we'll switch to a FixedBitSet(maxDoc) with fixed sized storage, bounded only by maxDoc. But we can just add a one-liner grow(long) that doesn't require the caller to understand any of this, and hide it via implementation detail.

rmuir · 2022-03-03T22:28:51Z

@iverase @jpountz I "undrafted" the PR and added a commit with the grow(long) that just truncates-n-forwards. It seems like the best compromise based on discussion above.

I also made some minor tweaks to the javadoc to try to simplify the explanation about what the grow parameter means. Again, it is kind of academic when you think about it, values larger than maxDoc >> 8 are not really needed by any code because we switch to the FixedBitSet. But the one-liner method doesn't bother me that much, i am just after keeping logic simple and abstractions minimal.

iverase · 2022-03-04T08:19:44Z

lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java

-      if ((long) totalAllocated + numDocs <= threshold) {
-        ensureBufferCapacity(numDocs);
+      if ((long) totalAllocated + numAdds <= threshold) {
+        ensureBufferCapacity(numAdds);


Are we not casting numAdds here back to a long again? I am fine with it.

For some reason I thought this method become private, I find weird to have two methods #grow(long) and #grow(int)

i find it weird to have a grow(long) at all when no sizes above ~8.3 million matter at all. But I'm trying to compromise here.

Are we not casting numAdds here back to a long again? I am fine with it.

come on man, all i did was rename local variable numDocs to numAdds to make things clear.

rmuir · 2022-03-04T11:35:42Z

For the record this DocIdSetBuilder.Buffer has been so damaging to our code, insanely, I'm still here trying to calm down the explosion of horribleness caused by it.

I opened https://issues.apache.org/jira/browse/LUCENE-10443 as a second pass to this PR to try to really dig into the situation. Personally I am in favor of switching back to SparseFixedBitSet.

Sticking with bitsets help defend against so many bad decisions such as bringing long into these apis when its not needed. I actually don't mind if the performance drops a bit to use the SparseFixedBitSet instead of this horrible "buffer". We have to take care of complexity.

…d truncates" This reverts commit c64ee66.

rmuir · 2022-03-04T11:45:17Z

I reverted adding helper grow(long). I won't be the one bringing 64 bits into this API. It builds DocId Sets. It is an implementation-detail that for small sets it may use an inefficient list (for now).

…nts"

iverase · 2022-03-05T09:02:48Z

I reverted the changes to the BKD tree as it is inconsistent with the current AssertingLeafReader implementation.

github-actions · 2024-01-08T12:25:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2025-10-17T00:27:58Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

rmuir requested review from iverase and jpountz February 24, 2022 14:59

rmuir mentioned this pull request Feb 24, 2022

LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms #692

Closed

iverase reviewed Feb 24, 2022

View reviewed changes

Simplify call to IntersectVisitor#grow when bulk visiting points

4c6b436

rmuir added 2 commits March 3, 2022 06:47

Merge branch 'main' into LUCENE-10311_alternative

ce172ec

rmuir marked this pull request as ready for review March 3, 2022 22:23

iverase reviewed Mar 4, 2022

View reviewed changes

Revert "LUCENE-10311: add long method, that forwards to int method an…

ceb5e30

…d truncates" This reverts commit c64ee66.

revert "Simplify call to IntersectVisitor#grow when bulk visiting poi…

5d1fbf4

…nts"

github-actions bot added the Stale label Jan 8, 2024

github-actions bot removed the Stale label Oct 2, 2025

github-actions bot added the Stale label Oct 17, 2025

LUCENE-10311: remove complex cost estimation and abstraction leakage around it #709

Are you sure you want to change the base?

LUCENE-10311: remove complex cost estimation and abstraction leakage around it #709

Uh oh!

Conversation

rmuir commented Feb 24, 2022 • edited by mocobeta Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

jpountz commented Feb 24, 2022

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

iverase commented Feb 25, 2022

Uh oh!

iverase commented Feb 25, 2022

Uh oh!

rmuir commented Feb 25, 2022

Uh oh!

rmuir commented Feb 25, 2022

Uh oh!

iverase commented Feb 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Feb 26, 2022

Uh oh!

iverase commented Feb 27, 2022

Uh oh!

rmuir commented Mar 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir commented Mar 4, 2022

Uh oh!

rmuir commented Mar 4, 2022

Uh oh!

iverase commented Mar 5, 2022

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rmuir commented Feb 24, 2022 •

edited by mocobeta

Loading

iverase commented Feb 24, 2022 •

edited

Loading

iverase commented Feb 25, 2022 •

edited

Loading