Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip docs with Docvalues in NumericLeafComparator #12405

Merged
merged 25 commits into from
Nov 9, 2023

Conversation

LuXugang
Copy link
Member

Description

Could we implement TermOrdValLeafComparator's same logic that using NumericDocValues to skip docs in NumericLeafComparator if we could not get a iterator by bkd?

@LuXugang LuXugang requested a review from jpountz June 29, 2023 14:51
@LuXugang
Copy link
Member Author

  public void test() throws IOException {
    final Directory dir = newDirectory();
    IndexWriterConfig config =
            new IndexWriterConfig()
                    // Make sure to use the default codec, otherwise some random points formats that have
                    // large values for maxPointsPerLeaf might not enable skipping with only 10k docs
                    .setCodec(TestUtil.getDefaultCodec());
    final IndexWriter writer = new IndexWriter(dir, config);
    final int numDocs = atLeast(10000);
    final int missValuesNumDocs = numDocs / 2;
    for (int i = 0; i < numDocs; ++i) {
      final Document doc = new Document();
      if (i <= missValuesNumDocs) { // missing value document
      } else {
        doc.add(new NumericDocValuesField("my_field", i));
        doc.add(new LongPoint("my_field", i));
      }
      writer.addDocument(doc);
    }
    final IndexReader reader = DirectoryReader.open(writer);
    writer.close();
    // single threaded so totalHits is deterministic
    IndexSearcher searcher =
            newSearcher(reader, random().nextBoolean(), random().nextBoolean(), false);
    final int numHits = 3;
    final int totalHitsThreshold = 3;

    { // test that optimization is run with NumericDocValues when missing value is NOT competitive
      final SortField sortField = new SortField("my_field", SortField.Type.LONG, true);
      sortField.setMissingValue(0L); // missing value is not competitive
      final Sort sort = new Sort(sortField);
      CollectorManager<TopFieldCollector, TopFieldDocs> manager =
              TopFieldCollector.createSharedManager(sort, numHits, null, totalHitsThreshold);
      TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), manager);
      assertEquals(topDocs.scoreDocs.length, numHits);
      assertEquals(
              topDocs.totalHits.value,
              numDocs); // assert that all documents were collected => optimization was not run
    }

    reader.close();
    dir.close();
  }

This test shows we could not skip document by bkd, but could use NumericDocValues to skip docs inthis RP.

@LuXugang LuXugang linked an issue Jun 29, 2023 that may be closed by this pull request
@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2023

I'm not clear if this change is still correct when there is another sort field after the one that gets optimized. It seems like it could skip hits that are still needed.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that the way to fix it would be to allow the comparator to know whether it's the only one, and if so it could make isMissingValueCompetitive return false when the bottom value is equal to the missing value.

@LuXugang
Copy link
Member Author

I'm not clear if this change is still correct when there is another sort field after the one that gets optimized

Thanks @jpountz. Oh, you are right, Sorry for missing this.

My intuition is that the way to fix it would be to allow the comparator to know whether it's the only one

The most simply way is to add a default boolean method like isOnlyComparator in LeafFieldComparator and call this method in LeafCollector. Is it a good way or worth for the comparator to know whether it's the only one?

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2023

We probably need to change SortField#getComparator. We introduced enableSkipping in the past, so that the first comparator in the chain could know it can dynamically prune, maybe it needs to become an enum, something like Pruning.(NONE|GREATER_THAN|GREATER_THAN_OR_EQUAL_TO), and we'd only use GREATER_THAN_OR_EQUAL_TO when there is a single comparator?

@LuXugang LuXugang requested a review from jpountz July 12, 2023 05:10
@jpountz
Copy link
Contributor

jpountz commented Jul 12, 2023

Thanks for adding the enum. In my view, we now need the two following changes:

  • isMissingValueCompetitive() should return false if the missing value is equal to the bottom value and the pruning is SKIP_MORE
  • The competitive iterator can better tune the min/max values. For instance currently, if the bottom value is 5 and the sort is ascending, we'll use a range on [MIN_VALUE, 5] to filter competitive hits. But with SKIP_MORE, we could now make it [MIN_VALUE, 4]. FYI we have NumericUtils#(add|subtract) to add/subtract on binary representations of numbers.

@LuXugang
Copy link
Member Author

Thanks for providing excellent advice.

isMissingValueCompetitive() should return false if the missing value is equal to the bottom value and the pruning is SKIP_MORE

addressed in 4a72bf6 and 94d560c

The competitive iterator can better tune the min/max values. For instance currently, if the bottom value is 5 and the sort is ascending, we'll use a range on [MIN_VALUE, 5] to filter competitive hits. But with SKIP_MORE, we could now make it [MIN_VALUE, 4]. FYI we have NumericUtils#(add|subtract) to add/subtract on binary representations of numbers.

addressed in 4d50317 and 416c333

@LuXugang LuXugang requested a review from jpountz July 18, 2023 08:56
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion, but this looks good to me in general. Can you add a CHANGES entry?

@LuXugang LuXugang requested a review from jpountz July 26, 2023 07:05
@jpountz
Copy link
Contributor

jpountz commented Jul 26, 2023

@LuXugang I pushed a commit with the changes I had in mind, what do you think?

@LuXugang
Copy link
Member Author

@LuXugang I pushed a commit with the changes I had in mind, what do you think?

It is nice simplication. Thanks.

@LuXugang
Copy link
Member Author

There is still a bug in this RP, test failed with
./gradlew test --tests TestSortOptimization.testRandomLong -Dtests.seed=6B2B316B7080952B -Dtests.locale=yav-Latn-CM -Dtests.timezone=Indian/Mahe -Dtests.asserts=true -Dtests.file.encoding=UTF-8

I would fix it later.

@LuXugang
Copy link
Member Author

LuXugang commented Aug 1, 2023

There is still a bug in this RP, test failed with ./gradlew test --tests TestSortOptimization.testRandomLong -Dtests.seed=6B2B316B7080952B -Dtests.locale=yav-Latn-CM -Dtests.timezone=Indian/Mahe -Dtests.asserts=true -Dtests.file.encoding=UTF-8

I would fix it later.

Sorry for the force-push and listt so much commits, this issue addressed in ed75fe7 and 9af4ba5

@LuXugang LuXugang requested a review from jpountz August 1, 2023 07:39
@mikemccand
Copy link
Member

@LuXugang @jpountz it looks like this PR went through some great discussions / iterations and was close towards the end, but it has accumulated some conflicts now?

@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

Sorry, since I had approved the PR, I had not understood it was still waiting on me. It's a great change, let's see how to get it in.

@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

I did my best at fixing conflicts, @LuXugang are you able to check the changes?

@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

Tests fail because the optimization kicks in in more cases than the test expects, it's not clear to me yet if it's a bug or not.

@LuXugang
Copy link
Member Author

LuXugang commented Nov 6, 2023

Sure thing @jpountz , I would work on this in the next few days.

@LuXugang
Copy link
Member Author

LuXugang commented Nov 7, 2023

Tests fail because the optimization kicks in in more cases than the test expects, it's not clear to me yet if it's a bug or not.

You are correct @jpountz , the optimization kicks in and it is not a bug.

In that failed test, after the queue is full with the only one comparator(means Pruning.GREATER_THAN_OR_EQUAL_TO) , the bottom value is Long.MAX_VALUE, the missing value which is Long.MAX_VALUE would be non-competitive. then optimization kicks in.

@LuXugang LuXugang requested review from jpountz and removed request for jpountz November 8, 2023 02:48
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@LuXugang LuXugang merged commit a71d64a into apache:main Nov 9, 2023
4 checks passed
LuXugang added a commit that referenced this pull request Nov 9, 2023
* Skip document by docValues

*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------

Co-authored-by: Adrien Grand <jpountz@gmail.com>
benwtrent added a commit to elastic/elasticsearch that referenced this pull request Nov 9, 2023
Introduced in apache/lucene#12405

We should account for the changes in our overrides and API. Now, to indicate that no skipping can occur, we utilize `Pruning.NONE`.
mch2 added a commit to mch2/OpenSearch that referenced this pull request Nov 30, 2023
Updates to support interface change to FieldComparatorSource to use Pruning enum instead of enableSkipping boolean.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
mch2 added a commit to mch2/OpenSearch that referenced this pull request Dec 6, 2023
Updates to support interface change to FieldComparatorSource to use Pruning enum instead of enableSkipping boolean.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip docs with Docvalues in NumericLeafComparator
3 participants