-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Lucene 9.9.1 #2302
Upgrade to Lucene 9.9.1 #2302
Conversation
This looks fine to me. Just a heads up, we're in the process of releasing a Lucene 9.9.1 to fix a couple of critical issues. If it suits we can delay this PR a little (to say early next week) so as to pickup 9.9.1. Or merge it and bump to 9.9.1 when it is available. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2302 +/- ##
============================================
+ Coverage 64.29% 64.36% +0.07%
- Complexity 1333 1334 +1
============================================
Files 203 203
Lines 11300 11328 +28
Branches 1426 1429 +3
============================================
+ Hits 7265 7291 +26
Misses 3558 3558
- Partials 477 479 +2 ☔ View full report in Codecov by Sentry. |
Sounds good! I need to run a bunch of tests anyway! |
Hi @ChrisHegarty can you check if I'm using Here are my preliminary experiments on cosDPR-distil on MS MARCO passage dev - Default (fp32):
Quantized (int8):
QPS increased, but index got bigger?! Not sure if it matters, but I'm giving |
Wow, nice speedup. The bigger index is expected, as we're storing the int8 quantized vectors side-by-side with the float32 vectors, so there is a ~25% increase of storage requirements. 25 * 1.25 = 32.5, which seems aligned with what you are observing. But only the quantized vectors are used at search time, the raw vectors don't need to be loaded in memory. |
@jpountz Ah, understood - thanks for the explanation. I'll run more experiments and then report back results. |
@lintool Do your benchmarks use a recent JDK (>= JDK 20)? If so, can you confirm that the JDK Panama Vector API module is present at runtime ( |
Nope, this is still JDK 11. Will queue this up as something to look at! |
hey @jpountz @ChrisHegarty for HNSW indexing, how do I set the index writer config to generate larger segments? I'm setting |
You can't, Lucene has a per-segment limit of ~2GB in memory (which then usually translates into a smaller size on disk). https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int) |
I see. Thanks for the quick response! |
FYI @jpountz @ChrisHegarty more thorough experiments documented in #2292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -190,7 +177,7 @@ public IndexCollection(Args args) throws Exception { | |||
} | |||
|
|||
final Directory dir = FSDirectory.open(Paths.get(args.index)); | |||
final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer()); | |||
final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer()).setCodec(new Lucene99Codec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lucene99
is the default in Lucene 9.9, so I guess it is fine not to specify it manually.
@@ -104,7 +100,7 @@ public IndexInvertedDenseVectors(Args args) { | |||
|
|||
try { | |||
final Directory dir = FSDirectory.open(Paths.get(args.index)); | |||
final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene95Codec()); | |||
final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene99Codec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, we can probably avoid setting the codec manually.
WIP, not ready for review.
All Unit tests pass.
@ChrisHegarty @tteofili @jpountz