Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomize KnnVector codec params in RandomCodec; addresses gh-14047 #14049

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Dec 6, 2024

This adds the randomization to RandomCodec and addresses some test issues.

  • In a few places I had to disable the randomization; basically wherever we are carefully testing scores. In these cases we won't have the same testing for quantized scores, but I think those would require more targeted testing since it's unclear what kinds of guarantees we can make about score values under quantization.
  • In a few other palces I added some slop to the score comparisons; basically if we had a score comparison in a test that was primarily focused on other things (like testing Explain)
  • I fixed a divide-by-zero edge case in the ScalarQuantizer (when there is only a single vector in a segment)
  • I standardized bounds checking so we have consistent exception messages for dimension mismatches

@msokolov
Copy link
Contributor Author

msokolov commented Dec 6, 2024

oops a test failed; my bad for only beasting the ones in core ... I'll look at this one:

gradlew :lucene:join:test --tests "org.apache.lucene.search.join.TestBlockJoin.testSimpleKnn" -Ptests.jvms=2 -Ptests.jvmargs= -Ptests.seed=E773842CB2A422CA

@benwtrent benwtrent self-requested a review December 6, 2024 19:25
@msokolov
Copy link
Contributor Author

msokolov commented Dec 6, 2024

Another test fails now (this is going to be fun, it's hard to ferret all these things out proactively):

TestBasicBackwardsCompatibility > testIndexOldIndex {Lucene-Version:10.0.0; Pattern: index.%1$s-cfs.zip} FAILED
    org.junit.ComparisonFailure: wrong first document expected:<4[4]> but was:<4[1]>
        at __randomizedtesting.SeedInfo.seed([22D75A8559728A8D:93B6C125DE4C6E18]:0)

Looking at this test, I can't tell where these assertions came from: I think they might be too ambitious? @s1monw did you invent this? The entire class was created out of whole cloth, but I assume at least some of these tests were pre-existing?

@s1monw
Copy link
Member

s1monw commented Dec 7, 2024

@msokolov I didn't invent this test. Yet, we never tested the current version obeying with it's own BWC expecations. Please make sure nothing randomizes anything in this test. BWC tests are not made for this. you need to supress anything that potentially randomizes stuff here.

Another test fails now (this is going to be fun, it's hard to ferret all these things out proactively):

maybe revert then? That's what it takes I guess?

@msokolov
Copy link
Contributor Author

msokolov commented Dec 7, 2024

Thank Simon, that's good feedback about avoiding randomization for these tests. There's nothing to revert - this hasn't been committed, it's just github running tests that is failing. I do wonder though if we run the risk of a user with a customized Codec (just applying settings like graph bushiness and enabling quantization requires a custom Codec, so it would be typical) failing to be able to load an older index version for some reason, and we'd have no advance warning from the back-compat tests. Maybe that shouldn't be a concern, I'm not sure. In any case I will disable the randomization for these tests.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this should have a "CHANGES" entry for the quantization bug fix. That bug will happen with uniform vector values which is common in testing infrastructure.

@msokolov
Copy link
Contributor Author

msokolov commented Dec 9, 2024

makes sense, I'll add a CHANGES entry and commit

@msokolov msokolov merged commit 6b0112c into apache:main Dec 9, 2024
3 checks passed
@msokolov msokolov deleted the gh-14047 branch December 9, 2024 16:22
@msokolov msokolov added this to the 11.0.0 milestone Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants