Add an S3-based directory. #13868

jpountz · 2024-10-07T15:47:13Z

There are so many projects (latest one I've heard of is Nixiesearch, presented at Haystack) trying to read Lucene indexes from S3, let's provide a S3-based (and other object stores) directory in Lucene directly?

shubhamvishu · 2024-10-12T19:26:04Z

Nice idea @jpountz! I'll try spending sometime working on this.

atris · 2024-10-15T05:12:32Z

@jpountz Interestingly, I have been working on this and was about to open an issue - if its ok, I will self assign this?

jpountz · 2024-10-15T07:26:32Z

@albogdano I'm curious if you have any interest in contributing your https://github.com/albogdano/lucene-s3directory?

@shubhamvishu @atris Thanks for volunteering to help! I'm keen on checking if @albogdano has interest in contributing first, but even if we went that route I'm sure we'll need help to properly support new Directory APIs like IndexInput#prefetch or also support the GCP and Azure counterparts of S3.

albogdano · 2024-10-15T08:35:26Z

@jpountz Yes! How can I help you guys? My knowledge of Lucene internals is quite limited and the goal of the lucene-s3directory was mainly to be a proof of concept.

jpountz · 2024-10-15T08:46:35Z

I'm thinking of a PR that would create a new lucene/directory/s3 module where we'd check in the code.

proof of concept

What is your gut feeling: should we rather start with your code and iterate on it to make it production-ready, or would it be easier/better to start from scratch, just taking inspiration from your existing code?

albogdano · 2024-10-15T08:57:08Z

You will save some time if you iterate on my code as it already implements the boring parts for integrating with the AWS SDK (version 2.x is used btw). All I did was to clone Shay's JDBCDirectory from Compass and make it work with S3. There are no extra features.

jpountz · 2024-10-15T10:11:09Z

Sounds good. Would you like to work on the PR?

albogdano · 2024-10-15T11:17:38Z

Yes, of course! Are there any requirements for the PR? It would be a fairly large chunk of code for a single PR and I'm not sure if that's allowed. Should I just add the code to a new branch and push for review?

jpountz · 2024-10-15T11:59:42Z

No special requirements, you may just need to adjust formatting (running ./gradlew tidy) and make sure it conforms with other requirements that are checked by the build, like forbidden APIs.

msfroh · 2024-10-15T16:36:16Z

I've been thinking about this for a bit. In addition to an S3-based directory, I believe there could be some benefit from defining an S3 (or other object store) codec inspired by Parquet.

That is, the existing Lucene formats are "pure" column-stride (i.e. fields are contiguous). If we split things into "row groups", I believe we could reduce the need for random reads and prefetches. I've been thinking of giving that a try. It's complementary to the object store directory itself, so it can be worked on independently.

albogdano mentioned this issue Oct 23, 2024

Add new Directory implementation for AWS S3 #13949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an S3-based directory. #13868

Add an S3-based directory. #13868

jpountz commented Oct 7, 2024

shubhamvishu commented Oct 12, 2024

atris commented Oct 15, 2024 •

edited

Loading

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

msfroh commented Oct 15, 2024

Add an S3-based directory. #13868

Add an S3-based directory. #13868

Comments

jpountz commented Oct 7, 2024

shubhamvishu commented Oct 12, 2024

atris commented Oct 15, 2024 • edited Loading

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

albogdano commented Oct 15, 2024

jpountz commented Oct 15, 2024

msfroh commented Oct 15, 2024

atris commented Oct 15, 2024 •

edited

Loading