LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms #692

iverase · 2022-02-18T12:42:34Z

The idea in this PR is the following:

Extract the sparse structure to collect docIds from docIdSetBuilder in a class called Buffers so it can be used by the different implementations.
Add a new DocIdSetBuilder implementation for points called PointsDocIdSetBuilder with the following API:

 /**
   * Utility class to efficiently add many docs in one go.
   *
   * @see PointsDocIdSetBuilder#grow
   */
  public abstract static class BulkAdder {
    public abstract void add(int doc);

    public void add(DocIdSetIterator iterator) throws IOException {
      int docID;
      while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        add(docID);
      }
    }
  }

  /**
   * Reserve space and return a {@link BulkAdder} object that can be used to visit up to {@code
   * numPoints} points.
   */
  public BulkAdder grow(long numPoints);

DocIdSet builder API looks like that after the extraction of the points API:

/**
   * Add the content of the provided {@link DocIdSetIterator} to this builder. NOTE: if you need to
   * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you should rather use {@link
   * RoaringDocIdSet.Builder}.
   */
public void add(DocIdSetIterator iter) throws IOException;


 /** Add a single document to this builder. */
  public void add(int doc);

… and terms

iverase · 2022-02-18T12:45:51Z

lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java

    if (bitSet != null) {
      bitSet.or(iter);
      return;
    }


I haven't change the implementation here but this looks buggy for me. If the bitSet is not null we don't update counter any more in this case?

rmuir · 2022-02-19T17:23:12Z

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements. I've re-read the description of the JIRA several times this morning, but honestly I was confused about this before, too.

It seems the only purpose of the long, we're doing a lot of elaborate work just to estimate the cardinality that we'll ultimately pass down to the bitset? I don't see any other use of the long value other than this counter, and that's all we are doing with it. But surely using a long isn't helpful to this estimation, maybe we should just estimate it differently?

Sorry if my comment isn't very helpful, but I want to really understand the problem and why we need to bring 64 bits into this, currently it is very confusing. Perhaps we should remove this counter completely (temporarily), and use the other BitDocIdSet constructor. How simpler does the code get then?

It turns the problem around, into, how can we estimate the cost better. I don't think we should have to majorly reorganize the code just for this counter, it doesn't seem right at all. There are other ways we could estimate the cost rather than summing up the calls to grow().

For example in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array? I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).

LuXugang · 2022-02-21T03:31:32Z

I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).

SparseFixedBitSet was truely used to gather docs in sparse case before LUCENE-6645.

iverase · 2022-02-21T07:42:10Z

I open #698 where I proposed a different way to compute const that reflect that the internal counter of DocIdSetBuilder refers to documents instead of values.

jpountz · 2022-02-22T08:09:12Z

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make DocIdSetBuilder#grow take a long instead of an int.

Maybe it's a javadocs issue because DocIdSetBuilder#grow says that it returns "a BulkAdder object that can be used to add up to numDocs documents", which might suggest that numDocs is the number of unique documents contributed, when in fact this number is simply an upper bound of the number of times that you may call BulkAdder#add on the returned BulkAdder object.

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the long counter that you looked at.

The point of BulkAdder#add is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than Integer#MAX_VALUE (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. SortedSetDocValues#nextOrd returns a long.

in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array?

This is already the case today, see the else block in DocIdSetBuilder#build. The cost estimation logic only happens in the dense case when a FixedBitSet is used to hold the set of matching docs.

FWIW we could change the estimation logic to perform a popCount over a subset of the FixedBitSet and scale it according to the size of the bitset or something along these lines, if we think that it would be better than tracking this counter and dividing it by the number of values per doc.

I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet

SparseFixedBitSet is the right choice for the sparse case when you need something that implements the BitSet API. Here we only need to produce a DocIdSet and buffering doc IDs into an array and sorting them using radix sort proved to be faster than accumulating doc IDs into a SparseFixedBitSet.

rmuir · 2022-02-22T12:07:03Z

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make DocIdSetBuilder#grow take a long instead of an int.

I've really tried, I think I have to just give up. Having a grow(long) on something with DocIdSet in its name is beyond bad, it is terrible.

Please, please, please don't make this change to take a long.

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the long counter that you looked at.

The point of BulkAdder#add is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than Integer#MAX_VALUE (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. SortedSetDocValues#nextOrd returns a long.

Sure it does. I'm looking at the only code using the 64-bit value, and that's the counter.

jpountz · 2022-02-22T12:30:40Z

Having a grow(long) on something with DocIdSet in its name is beyond bad, it is terrible.

Would it look better if we gave it a different name that doesn't suggest that it relates to the number of docs in the set, e.g. prepareAdd or something along these lines?

Please, please, please don't make this change to take a long.

I have a preference for making it a long but I'm ok with keeping it an integer. The downside is that it pushes the problem to callers, which need to make sure that they never add more than Integer.MAX_VALUE documents with the same BulkAdder.

rmuir · 2022-02-22T12:32:45Z

I've uploaded a screenshot here of how the only thing using 64-bits is this stupid counter. Guys, we really have to agree on this simple fact to proceed. It is a fact!

iverase · 2022-02-22T12:35:13Z

I don't understand all this discussion. Looking at the cost of a DocIdSetIterator:

  /**
   * Returns the estimated cost of this {@link DocIdSetIterator}.
   *
   * <p>This is generally an upper bound of the number of documents this iterator might match, but
   * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
   */
  public abstract long cost();

Why it is ok a long here? I think the dance we are doing on the BKD reader when wee are visiting more that Integer.MAX_VALUE ~~documents~~ points is wrong and should be fixed.

rmuir · 2022-02-22T12:36:19Z

Yeah, there seems to be some disagreement about what the code is actually doing. Probably because it is too confusing. Recommend (as i did before) to temporarily remove counter and cost estimation from here. Then you will see that 64 bits is not needed.

iverase · 2022-02-22T13:16:24Z

If you go a bit higher top in that class:

We are throwing 32 bits there now?

rmuir · 2022-02-22T13:18:16Z

it's fine to do that since only 32 bits are needed.

nothing uses 64-bits here, hence changing the api signature to a long is wrong.

rmuir · 2022-02-22T13:20:00Z

Seriously, let's remove this counter and cost estimation. @jpountz tells me I am wrong, but you can plainly see from the code, this issue is all about that. Everything else is only using 32 bits.

If we remove the silly counter and bad cost estimator, it will be clear that adding a long to this API is not needed: nothing needs the extra 32 bits, nothing uses the extra 32 bits!

jpountz · 2022-02-24T13:24:19Z

@rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

DocIdSetBuilder takes doc IDs in random order with potential duplicates and creates a DocIdSet that can iterate over doc IDs in order without any duplicates. If you index a multi-valued field with points, a very large segment that has 2^30 docs might have 2^32 points matching a range query, which translates into 2^29 documents matching the query. So DocIdBuilder#add would be called 2^32 times and DocIdSetBuilder#build would result in a DocIdSet that has 2^29 documents. This long is measuring the number of calls to DocIdSetBuilder#add, hence the long.

The naming may be wrong here, as the grow name probably suggests a number of docs rather than a number of calls to add, similarly to how ArrayUtil#grow is about the number of items in the array - not the number of times you set an index. Hopefully renaming it to prepareAdd(long numCallsToAdd) or something along these lines would help clarify.

rmuir · 2022-02-24T13:38:14Z

@rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

I really think it will address the problem. I understand what is happening, but adding 32 more bits that merely get discarded also will not help anything. That's what is being discussed here.

It really is all about cost estimation, as that is the ONLY thing in this PR actually using the 32 extra bits. That's why i propose to simply use a different cost estimation instead. The current cost estimation explodes the complexity of this class: that's why we are tracking:

boolean multiValued
double numValuesPerDoc
long counter

There's no need (from allocation perspective, which is all we should be concerned about here) to know about any numbers bigger than Integer.MAX_VALUE, if we get anywhere near numbers that big, we should be switching over to the FixedBitSet representation.

iverase · 2022-02-24T13:59:10Z

32 bits will need to be discarded anyway, the issue is where.

You either do it at the PointValues level by calling grow like:

visitor.grow((int) Math.min(getDocCount(), pointTree.size());

Or you discarded in the DocIdSetBuilder and allow grow to be called just like:

visitor.grow(pointTree.size());

rmuir · 2022-02-24T14:06:31Z

If this is literally all about "style" issue then let's be open and honest about that. I am fine with:

/** sugar: to just make code look pretty, nothing else */
public BulkAdder grow(long numDocs) {
  grow((int) Math.min(Integer.MAX_VALUE, numDocs));
}

But I think it is wrong to have constructors taking Terms and PointValues already: it is just more useless complexity and "leaky abstraction" from the terrible cost estimation.

And I definitely think having two separate classes just for the cost estimation is way too much.

rmuir · 2022-02-24T14:17:42Z

To try to be more helpful, here's what i'd propose. I can try to hack up a draft PR later if we want, if it is helpful.

DocIdSetBuilder, remove complex cost estimation:

remove DocIdSetBuilder(int, Terms) constructor
remove DocIdSetBuilder(int, PointValues) constructor
remove DocIdSetBuilder.counter member
remove DocIdSetBuilder.multiValued member
remove DocIdSetBuilder.numValuesPerDoc member

DocIdSetBuilder: add sugar grow(long) for style purposes:

/** sugar: to just make code look pretty, nothing else */
public BulkAdder grow(long numDocs) {
  grow((int) Math.min(Integer.MAX_VALUE, numDocs));
}

FixedBitSet: implement approximateCardinality() and simply use it when estimating cost() here.

rmuir · 2022-02-24T15:01:19Z

prototype: #709

iverase · 2022-03-16T16:48:43Z

won't happen

LUCENE-10311: Different implementations of DocIdSetBuilder for points…

23aa738

… and terms

iverase commented Feb 18, 2022

View reviewed changes

iverase requested a review from rmuir February 18, 2022 12:46

rmuir mentioned this pull request Feb 24, 2022

LUCENE-10311: remove complex cost estimation and abstraction leakage around it #709

Open

iverase closed this Mar 16, 2022

LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms #692

LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms #692

Uh oh!

Conversation

iverase commented Feb 18, 2022

Uh oh!

iverase Feb 18, 2022

Choose a reason for hiding this comment

Uh oh!

rmuir commented Feb 19, 2022

Uh oh!

LuXugang commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iverase commented Feb 21, 2022

Uh oh!

jpountz commented Feb 22, 2022

Uh oh!

rmuir commented Feb 22, 2022

Uh oh!

jpountz commented Feb 22, 2022

Uh oh!

rmuir commented Feb 22, 2022

Uh oh!

iverase commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Feb 22, 2022

Uh oh!

iverase commented Feb 22, 2022

Uh oh!

rmuir commented Feb 22, 2022

Uh oh!

rmuir commented Feb 22, 2022

Uh oh!

jpountz commented Feb 24, 2022

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

iverase commented Feb 24, 2022

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

rmuir commented Feb 24, 2022

Uh oh!

iverase commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LuXugang commented Feb 21, 2022 •

edited

Loading

iverase commented Feb 22, 2022 •

edited

Loading