Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Lucene 9.9.1 #2302

Merged
merged 13 commits into from
Dec 19, 2023
Merged

Upgrade to Lucene 9.9.1 #2302

merged 13 commits into from
Dec 19, 2023

Conversation

lintool
Copy link
Member

@lintool lintool commented Dec 13, 2023

WIP, not ready for review.
All Unit tests pass.

@ChrisHegarty @tteofili @jpountz

@lintool lintool marked this pull request as draft December 13, 2023 09:54
@lintool lintool mentioned this pull request Dec 13, 2023
@lintool lintool changed the title Upgrade to Lucene 9.9 Upgrade to Lucene 9.9.0 Dec 13, 2023
@ChrisHegarty
Copy link
Contributor

This looks fine to me. Just a heads up, we're in the process of releasing a Lucene 9.9.1 to fix a couple of critical issues. If it suits we can delay this PR a little (to say early next week) so as to pickup 9.9.1. Or merge it and bump to 9.9.1 when it is available.

Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (2c14a49) 64.29% compared to head (e6253e7) 64.36%.
Report is 1 commits behind head on master.

❗ Current head e6253e7 differs from pull request most recent head 2b6e14a. Consider uploading reports for the commit 2b6e14a to get more accurate results

Files Patch % Lines
.../java/io/anserini/index/IndexHnswDenseVectors.java 81.08% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2302      +/-   ##
============================================
+ Coverage     64.29%   64.36%   +0.07%     
- Complexity     1333     1334       +1     
============================================
  Files           203      203              
  Lines         11300    11328      +28     
  Branches       1426     1429       +3     
============================================
+ Hits           7265     7291      +26     
  Misses         3558     3558              
- Partials        477      479       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

This looks fine to me. Just a heads up, we're in the process of releasing a Lucene 9.9.1 to fix a couple of critical issues. If it suits we can delay this PR a little (to say early next week) so as to pickup 9.9.1. Or merge it and bump to 9.9.1 when it is available.

Sounds good! I need to run a bunch of tests anyway!

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

Hi @ChrisHegarty can you check if I'm using Lucene99HnswScalarQuantizedVectorsFormat correctly?

Here are my preliminary experiments on cosDPR-distil on MS MARCO passage dev -

Default (fp32):

# Total 8,841,823 documents indexed in 00:54:47
# 6980 queries processed in 00:02:57 = ~39.42 q/s
# 26G	indexes/lucene-hnsw.msmarco-passage-cos-dpr-distil

Quantized (int8):

# Total 8,841,823 documents indexed in 00:45:26
# 6980 queries processed in 00:01:41 = ~68.59 q/s
# 33G	lucene-hnsw.msmarco-passage-cos-dpr-distil

QPS increased, but index got bigger?!

Not sure if it matters, but I'm giving iwc.setRAMBufferSizeMB a generous 64GB buffer. This is on my Mac Studio with M1 Ultra processor.

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2023

Wow, nice speedup. The bigger index is expected, as we're storing the int8 quantized vectors side-by-side with the float32 vectors, so there is a ~25% increase of storage requirements. 25 * 1.25 = 32.5, which seems aligned with what you are observing. But only the quantized vectors are used at search time, the raw vectors don't need to be loaded in memory.

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

@jpountz Ah, understood - thanks for the explanation. I'll run more experiments and then report back results.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Dec 13, 2023

@lintool Do your benchmarks use a recent JDK (>= JDK 20)? If so, can you confirm that the JDK Panama Vector API module is present at runtime (--add-modules=jdk.incubator.vector). Lucene will use it if present, to improve the speed of vector distance computations. On an M1, it'll use 128 bit Neon instructions for floating point arithmetic, while AVX 2/512 on x64.

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

@lintool Do your benchmarks use a recent JDK (>= JDK 20)? If so, can you confirm that the JDK Panama Vector API module is present at runtime (--add-modules=jdk.incubator.vector). Lucene will use it if present, to improve the speed of vector distance computations. On an M1, it'll use 128 bit Neon instructions for floating point arithmetic, while AVX 2/512 on x64.

Nope, this is still JDK 11. Will queue this up as something to look at!

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

hey @jpountz @ChrisHegarty for HNSW indexing, how do I set the index writer config to generate larger segments? I'm setting config.setRAMBufferSizeMB to 64GB, but looking at the index, I only see segments of 1.5GB.

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2023

You can't, Lucene has a per-segment limit of ~2GB in memory (which then usually translates into a smaller size on disk). https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int)

@lintool
Copy link
Member Author

lintool commented Dec 13, 2023

You can't, Lucene has a per-segment limit of ~2GB in memory (which then usually translates into a smaller size on disk). https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int)

I see. Thanks for the quick response!

@lintool
Copy link
Member Author

lintool commented Dec 14, 2023

FYI @jpountz @ChrisHegarty more thorough experiments documented in #2292

@lintool lintool changed the title Upgrade to Lucene 9.9.0 Upgrade to Lucene 9.9.1 Dec 17, 2023
@lintool lintool requested a review from tteofili December 18, 2023 11:59
Copy link
Collaborator

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -190,7 +177,7 @@ public IndexCollection(Args args) throws Exception {
}

final Directory dir = FSDirectory.open(Paths.get(args.index));
final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer());
final IndexWriterConfig config = new IndexWriterConfig(getAnalyzer()).setCodec(new Lucene99Codec());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lucene99 is the default in Lucene 9.9, so I guess it is fine not to specify it manually.

@@ -104,7 +100,7 @@ public IndexInvertedDenseVectors(Args args) {

try {
final Directory dir = FSDirectory.open(Paths.get(args.index));
final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene95Codec());
final IndexWriterConfig config = new IndexWriterConfig(analyzer).setCodec(new Lucene99Codec());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, we can probably avoid setting the codec manually.

@lintool lintool marked this pull request as ready for review December 19, 2023 18:53
@lintool lintool merged commit 883539b into master Dec 19, 2023
1 check passed
@lintool lintool deleted the lucene9.9 branch December 19, 2023 18:56
@lintool lintool mentioned this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants