Upgrade to Lucene 9.9.1 #2302

lintool · 2023-12-13T09:53:51Z

WIP, not ready for review.
All Unit tests pass.

ChrisHegarty · 2023-12-13T09:57:53Z

This looks fine to me. Just a heads up, we're in the process of releasing a Lucene 9.9.1 to fix a couple of critical issues. If it suits we can delay this PR a little (to say early next week) so as to pickup 9.9.1. Or merge it and bump to 9.9.1 when it is available.

codecov · 2023-12-13T09:59:51Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (2c14a49) 64.29% compared to head (e6253e7) 64.36%.
Report is 1 commits behind head on master.

❗ Current head e6253e7 differs from pull request most recent head 2b6e14a. Consider uploading reports for the commit 2b6e14a to get more accurate results

Files	Patch %	Lines
.../java/io/anserini/index/IndexHnswDenseVectors.java	81.08%	4 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2302      +/-   ##
============================================
+ Coverage     64.29%   64.36%   +0.07%     
- Complexity     1333     1334       +1     
============================================
  Files           203      203              
  Lines         11300    11328      +28     
  Branches       1426     1429       +3     
============================================
+ Hits           7265     7291      +26     
  Misses         3558     3558              
- Partials        477      479       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lintool · 2023-12-13T10:58:48Z

This looks fine to me. Just a heads up, we're in the process of releasing a Lucene 9.9.1 to fix a couple of critical issues. If it suits we can delay this PR a little (to say early next week) so as to pickup 9.9.1. Or merge it and bump to 9.9.1 when it is available.

Sounds good! I need to run a bunch of tests anyway!

lintool · 2023-12-13T12:41:40Z

Hi @ChrisHegarty can you check if I'm using Lucene99HnswScalarQuantizedVectorsFormat correctly?

Here are my preliminary experiments on cosDPR-distil on MS MARCO passage dev -

Default (fp32):

# Total 8,841,823 documents indexed in 00:54:47
# 6980 queries processed in 00:02:57 = ~39.42 q/s
# 26G	indexes/lucene-hnsw.msmarco-passage-cos-dpr-distil

Quantized (int8):

# Total 8,841,823 documents indexed in 00:45:26
# 6980 queries processed in 00:01:41 = ~68.59 q/s
# 33G	lucene-hnsw.msmarco-passage-cos-dpr-distil

QPS increased, but index got bigger?!

Not sure if it matters, but I'm giving iwc.setRAMBufferSizeMB a generous 64GB buffer. This is on my Mac Studio with M1 Ultra processor.

jpountz · 2023-12-13T12:57:32Z

Wow, nice speedup. The bigger index is expected, as we're storing the int8 quantized vectors side-by-side with the float32 vectors, so there is a ~25% increase of storage requirements. 25 * 1.25 = 32.5, which seems aligned with what you are observing. But only the quantized vectors are used at search time, the raw vectors don't need to be loaded in memory.

lintool · 2023-12-13T14:07:30Z

@jpountz Ah, understood - thanks for the explanation. I'll run more experiments and then report back results.

ChrisHegarty · 2023-12-13T14:14:04Z

@lintool Do your benchmarks use a recent JDK (>= JDK 20)? If so, can you confirm that the JDK Panama Vector API module is present at runtime (--add-modules=jdk.incubator.vector). Lucene will use it if present, to improve the speed of vector distance computations. On an M1, it'll use 128 bit Neon instructions for floating point arithmetic, while AVX 2/512 on x64.

lintool · 2023-12-13T14:17:17Z

@lintool Do your benchmarks use a recent JDK (>= JDK 20)? If so, can you confirm that the JDK Panama Vector API module is present at runtime (--add-modules=jdk.incubator.vector). Lucene will use it if present, to improve the speed of vector distance computations. On an M1, it'll use 128 bit Neon instructions for floating point arithmetic, while AVX 2/512 on x64.

Nope, this is still JDK 11. Will queue this up as something to look at!

lintool · 2023-12-13T17:11:39Z

hey @jpountz @ChrisHegarty for HNSW indexing, how do I set the index writer config to generate larger segments? I'm setting config.setRAMBufferSizeMB to 64GB, but looking at the index, I only see segments of 1.5GB.

jpountz · 2023-12-13T17:15:40Z

You can't, Lucene has a per-segment limit of ~2GB in memory (which then usually translates into a smaller size on disk). https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int)

lintool · 2023-12-13T17:18:41Z

You can't, Lucene has a per-segment limit of ~2GB in memory (which then usually translates into a smaller size on disk). https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int)

I see. Thanks for the quick response!

lintool · 2023-12-14T14:08:56Z

FYI @jpountz @ChrisHegarty more thorough experiments documented in #2292

tteofili

LGTM

tteofili · 2023-12-18T16:13:09Z

src/main/java/io/anserini/index/IndexCollection.java

@@ -190,7 +177,7 @@ public IndexCollection(Args args) throws Exception {
    }

    final Directory dir = FSDirectory.open(Paths.get(args.index));
-    final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer());
+    final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer()).setCodec(new Lucene99Codec());


lucene99 is the default in Lucene 9.9, so I guess it is fine not to specify it manually.

tteofili · 2023-12-18T16:14:00Z

src/main/java/io/anserini/index/IndexInvertedDenseVectors.java

@@ -104,7 +100,7 @@ public IndexInvertedDenseVectors(Args args) {

    try {
      final Directory dir = FSDirectory.open(Paths.get(args.index));
-      final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene95Codec());
+      final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene99Codec());


same as above, we can probably avoid setting the codec manually.

Upgrade to Lucene 9.9

beccd4e

lintool marked this pull request as draft December 13, 2023 09:54

lintool mentioned this pull request Dec 13, 2023

Upgrade to Lucene 9.9.0 #2290

Closed

lintool changed the title ~~Upgrade to Lucene 9.9~~ Upgrade to Lucene 9.9.0 Dec 13, 2023

Add int8 option for hnsw.

4d6ae03

Add int8 regression.

4473ba6

lintool added 3 commits December 13, 2023 12:30

Tweak to HNSW iwc.

0ab46f8

Add config.

04043ca

Added -noMerge option.

0b9f79e

lintool mentioned this pull request Dec 14, 2023

Lucene 9.9: Benchmark HNSW improvements #2292

Closed

lintool added 3 commits December 14, 2023 11:04

Updated regressions yaml files.

9301474

Added docs.

cef6d36

Upgrade to 9.9.1; made setRAMPerThreadHardLimitMB settable.

a1da98d

lintool changed the title ~~Upgrade to Lucene 9.9.0~~ Upgrade to Lucene 9.9.1 Dec 17, 2023

lintool added 2 commits December 17, 2023 10:19

Merge branch 'master' into lucene9.9

f53acbe

Less strict score matching on HNSW tests.

ae8c68d

added script/doc bindings for regressions; no openai-int8

e6253e7

lintool requested a review from tteofili December 18, 2023 11:59

tteofili approved these changes Dec 18, 2023

View reviewed changes

Addressed CR. added note of errors to openai int8 indexes.

2b6e14a

lintool marked this pull request as ready for review December 19, 2023 18:53

lintool merged commit 883539b into master Dec 19, 2023
1 check passed

lintool deleted the lucene9.9 branch December 19, 2023 18:56

lintool mentioned this pull request Dec 20, 2023

Upgrade to Lucene 9.9 #2288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Lucene 9.9.1 #2302

Upgrade to Lucene 9.9.1 #2302

lintool commented Dec 13, 2023

ChrisHegarty commented Dec 13, 2023

codecov bot commented Dec 13, 2023 •

edited

Loading

lintool commented Dec 13, 2023

lintool commented Dec 13, 2023

jpountz commented Dec 13, 2023

lintool commented Dec 13, 2023

ChrisHegarty commented Dec 13, 2023 •

edited

Loading

lintool commented Dec 13, 2023

lintool commented Dec 13, 2023

jpountz commented Dec 13, 2023

lintool commented Dec 13, 2023

lintool commented Dec 14, 2023

tteofili left a comment

tteofili Dec 18, 2023

tteofili Dec 18, 2023

Upgrade to Lucene 9.9.1 #2302

Upgrade to Lucene 9.9.1 #2302

Conversation

lintool commented Dec 13, 2023

ChrisHegarty commented Dec 13, 2023

codecov bot commented Dec 13, 2023 • edited Loading

Codecov Report

lintool commented Dec 13, 2023

lintool commented Dec 13, 2023

jpountz commented Dec 13, 2023

lintool commented Dec 13, 2023

ChrisHegarty commented Dec 13, 2023 • edited Loading

lintool commented Dec 13, 2023

lintool commented Dec 13, 2023

jpountz commented Dec 13, 2023

lintool commented Dec 13, 2023

lintool commented Dec 14, 2023

tteofili left a comment

Choose a reason for hiding this comment

tteofili Dec 18, 2023

Choose a reason for hiding this comment

tteofili Dec 18, 2023

Choose a reason for hiding this comment

codecov bot commented Dec 13, 2023 •

edited

Loading

ChrisHegarty commented Dec 13, 2023 •

edited

Loading