Skip to content

Conversation

@rmuir
Copy link
Member

@rmuir rmuir commented Feb 24, 2022

Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak.

Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!

#11347

…around it

Cost estimation drives the API complexity out of control, we don't need
it. Hopefully i've cleared up all the API damage from this explosive
leak.

Instead, FixedBitSet.approximateCardinality() is used for cost
estimation. TODO: let's optimize that!
@rmuir rmuir requested review from iverase and jpountz February 24, 2022 14:59
@rmuir
Copy link
Member Author

rmuir commented Feb 24, 2022

Here's a first stab of what i proposed on #692

You can see how damaging the current cost() implementation is.

As followup commits we can add the grow(long) sugar that simply truncates. And we should optimize FixedBitSet.approximateCardinality(). After doing that, we should look around and see if there is any other similar damage to our APIs related to the fact that FixedBitSet had a slow approximateCardinality and fix those, too.

@jpountz
Copy link
Contributor

jpountz commented Feb 24, 2022

That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in LSBRadixSorter#reorder so it's ok to always deduplicate.

@rmuir
Copy link
Member Author

rmuir commented Feb 24, 2022

If we want to add the grow(long) sugar method that simply truncates to Integer.MAX_VALUE and clean up all the points callsites, or write a cool FixedBitSet.approximateCardinality, just feel free to push commits here. Otherwise I will get to these two things later and remove draft status on the PR.

Adding the sugar method is easy, it is just work.
Implementing the approximateCardinality requires some thought and prolly some benchmarking. I had in mind to just "sample" some "chunks" of the long[] and sum up Long.bitCount across the ranges. In upcoming JDK this method will get vectorized, let's take advantage of that, so then both cardinality() and approximateCardinality would get faster: openjdk/jdk#6857

sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length);
final int l;
if (multivalued) {
l = dedup(concatenated.array, concatenated.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to throw away this optimisation? we normally know if our data is single or multi-valued so it seems wasteful not to exploit it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization doesnt make sense to me. Buffers should only be used for tiny sets (they are very memory expensive).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am convinced. Thanks!

assert counter >= 0;
final long cost = Math.round(counter / numValuesPerDoc);
return new BitDocIdSet(bitSet, cost);
return new BitDocIdSet(bitSet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still ned to implement the method estimateCardinality which is the hard bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is difficult, it just requires a little work. I can get to it soon, seems like it should be fun. Ultimately I think it will give us better estimations than what we have today, without all the tangled APIs and abstraction leakage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of sampling, thanks

@iverase
Copy link
Contributor

iverase commented Feb 24, 2022

I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:

The same for SimpleTextBKDReader#addAll

@rmuir
Copy link
Member Author

rmuir commented Feb 24, 2022

I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing

Sorry, I'm not so familiar with the code in question. Does it mean we can remove the grown parameter here and the split logic around it for the addAll() method? If so, that sounds great!

@iverase
Copy link
Contributor

iverase commented Feb 25, 2022

I remove the parameter grown from addAll in 4c6b436

@iverase
Copy link
Contributor

iverase commented Feb 25, 2022

Oh, but that might still be not correct. The buffers implementation does not grow with unique documents but with every call of BulkAdder#add because it does not discard duplicates. What I did only works if I assume that providing Integer.MAX_VALUE, the builder can add any number of calls to BulkAdder#add. Something is not totally right here as Buffers requires to know how many calls you are gonna be doing to BulkAdder#add and not the number of unique documents you are adding.

@rmuir
Copy link
Member Author

rmuir commented Feb 25, 2022

There's no way we're allowing more than Integer.MAX_VALUE calls going to this buffers thing.

@rmuir
Copy link
Member Author

rmuir commented Feb 25, 2022

seriously, look at threshold. its maxDoc >>> 7. maxDoc is an int.

when you call grow(anywhere close to Integer.MAX_VALUE) then buffers exits the stage permanently.

64 bits are not needed.

@iverase
Copy link
Contributor

iverase commented Feb 25, 2022

What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail.

#grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here.

We can add that if grow is called with Integer.MAX_VALUE, there is not limit to the calls to BulkAdder#addDoc or something along those lines.

wdyt?

Finally, we might need to modify the AssertingLeafReader as it asserts that you call #grow with the number of points you are going to visit, which in this case is not true all the time.

@rmuir
Copy link
Member Author

rmuir commented Feb 26, 2022

if we add grow(long) that simply truncates and forwards, then it encapsulates this within this class. The code stays simple and the caller doesn't need to know about it.

@iverase
Copy link
Contributor

iverase commented Feb 27, 2022

+1 That would hide the implementation details from users.

rmuir added 2 commits March 3, 2022 06:47
Confusion happens because grow(numAdds) reserves space for you to call
add() up to numAdds times.

When numAdds exceeds a "threshold" (maxdoc >> 8), we really don't care
about big numbers at all: we'll switch to a FixedBitSet(maxDoc) with
fixed sized storage, bounded only by maxDoc.

But we can just add a one-liner grow(long) that doesn't require the
caller to understand any of this, and hide it via implementation detail.
@rmuir rmuir marked this pull request as ready for review March 3, 2022 22:23
@rmuir
Copy link
Member Author

rmuir commented Mar 3, 2022

@iverase @jpountz I "undrafted" the PR and added a commit with the grow(long) that just truncates-n-forwards. It seems like the best compromise based on discussion above.

I also made some minor tweaks to the javadoc to try to simplify the explanation about what the grow parameter means. Again, it is kind of academic when you think about it, values larger than maxDoc >> 8 are not really needed by any code because we switch to the FixedBitSet. But the one-liner method doesn't bother me that much, i am just after keeping logic simple and abstractions minimal.

if ((long) totalAllocated + numDocs <= threshold) {
ensureBufferCapacity(numDocs);
if ((long) totalAllocated + numAdds <= threshold) {
ensureBufferCapacity(numAdds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not casting numAdds here back to a long again? I am fine with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I thought this method become private, I find weird to have two methods #grow(long) and #grow(int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find it weird to have a grow(long) at all when no sizes above ~8.3 million matter at all. But I'm trying to compromise here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not casting numAdds here back to a long again? I am fine with it.

come on man, all i did was rename local variable numDocs to numAdds to make things clear.

@rmuir
Copy link
Member Author

rmuir commented Mar 4, 2022

For the record this DocIdSetBuilder.Buffer has been so damaging to our code, insanely, I'm still here trying to calm down the explosion of horribleness caused by it.

I opened https://issues.apache.org/jira/browse/LUCENE-10443 as a second pass to this PR to try to really dig into the situation. Personally I am in favor of switching back to SparseFixedBitSet.

Sticking with bitsets help defend against so many bad decisions such as bringing long into these apis when its not needed. I actually don't mind if the performance drops a bit to use the SparseFixedBitSet instead of this horrible "buffer". We have to take care of complexity.

@rmuir
Copy link
Member Author

rmuir commented Mar 4, 2022

I reverted adding helper grow(long). I won't be the one bringing 64 bits into this API. It builds DocId Sets. It is an implementation-detail that for small sets it may use an inefficient list (for now).

@iverase
Copy link
Contributor

iverase commented Mar 5, 2022

I reverted the changes to the BKD tree as it is inconsistent with the current AssertingLeafReader implementation.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@github-actions github-actions bot removed the Stale label Oct 2, 2025
@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants