Skip to content

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Feb 18, 2022

The idea in this PR is the following:

  1. Extract the sparse structure to collect docIds from docIdSetBuilder in a class called Buffers so it can be used by the different implementations.
  2. Add a new DocIdSetBuilder implementation for points called PointsDocIdSetBuilder with the following API:
 /**
   * Utility class to efficiently add many docs in one go.
   *
   * @see PointsDocIdSetBuilder#grow
   */
  public abstract static class BulkAdder {
    public abstract void add(int doc);

    public void add(DocIdSetIterator iterator) throws IOException {
      int docID;
      while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        add(docID);
      }
    }
  }

  /**
   * Reserve space and return a {@link BulkAdder} object that can be used to visit up to {@code
   * numPoints} points.
   */
  public BulkAdder grow(long numPoints);
  1. DocIdSet builder API looks like that after the extraction of the points API:
/**
   * Add the content of the provided {@link DocIdSetIterator} to this builder. NOTE: if you need to
   * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you should rather use {@link
   * RoaringDocIdSet.Builder}.
   */
public void add(DocIdSetIterator iter) throws IOException;


 /** Add a single document to this builder. */
  public void add(int doc);

if (bitSet != null) {
bitSet.or(iter);
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't change the implementation here but this looks buggy for me. If the bitSet is not null we don't update counter any more in this case?

@iverase iverase requested a review from rmuir February 18, 2022 12:46
@rmuir
Copy link
Member

rmuir commented Feb 19, 2022

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements. I've re-read the description of the JIRA several times this morning, but honestly I was confused about this before, too.

It seems the only purpose of the long, we're doing a lot of elaborate work just to estimate the cardinality that we'll ultimately pass down to the bitset? I don't see any other use of the long value other than this counter, and that's all we are doing with it. But surely using a long isn't helpful to this estimation, maybe we should just estimate it differently?

Sorry if my comment isn't very helpful, but I want to really understand the problem and why we need to bring 64 bits into this, currently it is very confusing. Perhaps we should remove this counter completely (temporarily), and use the other BitDocIdSet constructor. How simpler does the code get then?

It turns the problem around, into, how can we estimate the cost better. I don't think we should have to majorly reorganize the code just for this counter, it doesn't seem right at all. There are other ways we could estimate the cost rather than summing up the calls to grow().

For example in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array? I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).

@LuXugang
Copy link
Member

LuXugang commented Feb 21, 2022

I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).

SparseFixedBitSet was truely used to gather docs in sparse case before LUCENE-6645.

@iverase
Copy link
Contributor Author

iverase commented Feb 21, 2022

I open #698 where I proposed a different way to compute const that reflect that the internal counter of DocIdSetBuilder refers to documents instead of values.

@jpountz
Copy link
Contributor

jpountz commented Feb 22, 2022

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make DocIdSetBuilder#grow take a long instead of an int.

Maybe it's a javadocs issue because DocIdSetBuilder#grow says that it returns "a BulkAdder object that can be used to add up to numDocs documents", which might suggest that numDocs is the number of unique documents contributed, when in fact this number is simply an upper bound of the number of times that you may call BulkAdder#add on the returned BulkAdder object.

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the long counter that you looked at.

The point of BulkAdder#add is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than Integer#MAX_VALUE (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. SortedSetDocValues#nextOrd returns a long.

in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array?

This is already the case today, see the else block in DocIdSetBuilder#build. The cost estimation logic only happens in the dense case when a FixedBitSet is used to hold the set of matching docs.

FWIW we could change the estimation logic to perform a popCount over a subset of the FixedBitSet and scale it according to the size of the bitset or something along these lines, if we think that it would be better than tracking this counter and dividing it by the number of values per doc.

I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet

SparseFixedBitSet is the right choice for the sparse case when you need something that implements the BitSet API. Here we only need to produce a DocIdSet and buffering doc IDs into an array and sorting them using radix sort proved to be faster than accumulating doc IDs into a SparseFixedBitSet.

@rmuir
Copy link
Member

rmuir commented Feb 22, 2022

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make DocIdSetBuilder#grow take a long instead of an int.

I've really tried, I think I have to just give up. Having a grow(long) on something with DocIdSet in its name is beyond bad, it is terrible.

Please, please, please don't make this change to take a long.

I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the long counter that you looked at.

The point of BulkAdder#add is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than Integer#MAX_VALUE (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. SortedSetDocValues#nextOrd returns a long.

Sure it does. I'm looking at the only code using the 64-bit value, and that's the counter.

@jpountz
Copy link
Contributor

jpountz commented Feb 22, 2022

Having a grow(long) on something with DocIdSet in its name is beyond bad, it is terrible.

Would it look better if we gave it a different name that doesn't suggest that it relates to the number of docs in the set, e.g. prepareAdd or something along these lines?

Please, please, please don't make this change to take a long.

I have a preference for making it a long but I'm ok with keeping it an integer. The downside is that it pushes the problem to callers, which need to make sure that they never add more than Integer.MAX_VALUE documents with the same BulkAdder.

@rmuir
Copy link
Member

rmuir commented Feb 22, 2022

Screen Shot 2022-02-22 at 7 29 00 AM

I've uploaded a screenshot here of how the only thing using 64-bits is this stupid counter. Guys, we really have to agree on this simple fact to proceed. It is a fact!

@iverase
Copy link
Contributor Author

iverase commented Feb 22, 2022

I don't understand all this discussion. Looking at the cost of a DocIdSetIterator:

  /**
   * Returns the estimated cost of this {@link DocIdSetIterator}.
   *
   * <p>This is generally an upper bound of the number of documents this iterator might match, but
   * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
   */
  public abstract long cost();

Why it is ok a long here? I think the dance we are doing on the BKD reader when wee are visiting more that Integer.MAX_VALUE documents points is wrong and should be fixed.

@rmuir
Copy link
Member

rmuir commented Feb 22, 2022

Yeah, there seems to be some disagreement about what the code is actually doing. Probably because it is too confusing. Recommend (as i did before) to temporarily remove counter and cost estimation from here. Then you will see that 64 bits is not needed.

@iverase
Copy link
Contributor Author

iverase commented Feb 22, 2022

If you go a bit higher top in that class:

image

We are throwing 32 bits there now?

@rmuir
Copy link
Member

rmuir commented Feb 22, 2022

it's fine to do that since only 32 bits are needed.

nothing uses 64-bits here, hence changing the api signature to a long is wrong.

@rmuir
Copy link
Member

rmuir commented Feb 22, 2022

Seriously, let's remove this counter and cost estimation. @jpountz tells me I am wrong, but you can plainly see from the code, this issue is all about that. Everything else is only using 32 bits.

If we remove the silly counter and bad cost estimator, it will be clear that adding a long to this API is not needed: nothing needs the extra 32 bits, nothing uses the extra 32 bits!

@jpountz
Copy link
Contributor

jpountz commented Feb 24, 2022

@rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

DocIdSetBuilder takes doc IDs in random order with potential duplicates and creates a DocIdSet that can iterate over doc IDs in order without any duplicates. If you index a multi-valued field with points, a very large segment that has 2^30 docs might have 2^32 points matching a range query, which translates into 2^29 documents matching the query. So DocIdBuilder#add would be called 2^32 times and DocIdSetBuilder#build would result in a DocIdSet that has 2^29 documents. This long is measuring the number of calls to DocIdSetBuilder#add, hence the long.

The naming may be wrong here, as the grow name probably suggests a number of docs rather than a number of calls to add, similarly to how ArrayUtil#grow is about the number of items in the array - not the number of times you set an index. Hopefully renaming it to prepareAdd(long numCallsToAdd) or something along these lines would help clarify.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2022

@rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

I really think it will address the problem. I understand what is happening, but adding 32 more bits that merely get discarded also will not help anything. That's what is being discussed here.

It really is all about cost estimation, as that is the ONLY thing in this PR actually using the 32 extra bits. That's why i propose to simply use a different cost estimation instead. The current cost estimation explodes the complexity of this class: that's why we are tracking:

  • boolean multiValued
  • double numValuesPerDoc
  • long counter

There's no need (from allocation perspective, which is all we should be concerned about here) to know about any numbers bigger than Integer.MAX_VALUE, if we get anywhere near numbers that big, we should be switching over to the FixedBitSet representation.

@iverase
Copy link
Contributor Author

iverase commented Feb 24, 2022

32 bits will need to be discarded anyway, the issue is where.

You either do it at the PointValues level by calling grow like:

visitor.grow((int) Math.min(getDocCount(), pointTree.size());

Or you discarded in the DocIdSetBuilder and allow grow to be called just like:

visitor.grow(pointTree.size());

@rmuir
Copy link
Member

rmuir commented Feb 24, 2022

If this is literally all about "style" issue then let's be open and honest about that. I am fine with:

/** sugar: to just make code look pretty, nothing else */
public BulkAdder grow(long numDocs) {
  grow((int) Math.min(Integer.MAX_VALUE, numDocs));
}

But I think it is wrong to have constructors taking Terms and PointValues already: it is just more useless complexity and "leaky abstraction" from the terrible cost estimation.

And I definitely think having two separate classes just for the cost estimation is way too much.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2022

To try to be more helpful, here's what i'd propose. I can try to hack up a draft PR later if we want, if it is helpful.

DocIdSetBuilder, remove complex cost estimation:

  • remove DocIdSetBuilder(int, Terms) constructor
  • remove DocIdSetBuilder(int, PointValues) constructor
  • remove DocIdSetBuilder.counter member
  • remove DocIdSetBuilder.multiValued member
  • remove DocIdSetBuilder.numValuesPerDoc member

DocIdSetBuilder: add sugar grow(long) for style purposes:

/** sugar: to just make code look pretty, nothing else */
public BulkAdder grow(long numDocs) {
  grow((int) Math.min(Integer.MAX_VALUE, numDocs));
}

FixedBitSet: implement approximateCardinality() and simply use it when estimating cost() here.

@rmuir
Copy link
Member

rmuir commented Feb 24, 2022

prototype: #709

@iverase
Copy link
Contributor Author

iverase commented Mar 16, 2022

won't happen

@iverase iverase closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants