Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an S3-based directory. #13868

Open
jpountz opened this issue Oct 7, 2024 · 10 comments
Open

Add an S3-based directory. #13868

jpountz opened this issue Oct 7, 2024 · 10 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 7, 2024

There are so many projects (latest one I've heard of is Nixiesearch, presented at Haystack) trying to read Lucene indexes from S3, let's provide a S3-based (and other object stores) directory in Lucene directly?

@shubhamvishu
Copy link
Contributor

Nice idea @jpountz! I'll try spending sometime working on this.

@atris
Copy link
Contributor

atris commented Oct 15, 2024

@jpountz Interestingly, I have been working on this and was about to open an issue - if its ok, I will self assign this?

@jpountz
Copy link
Contributor Author

jpountz commented Oct 15, 2024

@albogdano I'm curious if you have any interest in contributing your https://github.com/albogdano/lucene-s3directory?

@shubhamvishu @atris Thanks for volunteering to help! I'm keen on checking if @albogdano has interest in contributing first, but even if we went that route I'm sure we'll need help to properly support new Directory APIs like IndexInput#prefetch or also support the GCP and Azure counterparts of S3.

@albogdano
Copy link

@jpountz Yes! How can I help you guys? My knowledge of Lucene internals is quite limited and the goal of the lucene-s3directory was mainly to be a proof of concept.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 15, 2024

I'm thinking of a PR that would create a new lucene/directory/s3 module where we'd check in the code.

proof of concept

What is your gut feeling: should we rather start with your code and iterate on it to make it production-ready, or would it be easier/better to start from scratch, just taking inspiration from your existing code?

@albogdano
Copy link

You will save some time if you iterate on my code as it already implements the boring parts for integrating with the AWS SDK (version 2.x is used btw). All I did was to clone Shay's JDBCDirectory from Compass and make it work with S3. There are no extra features.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 15, 2024

Sounds good. Would you like to work on the PR?

@albogdano
Copy link

Yes, of course! Are there any requirements for the PR? It would be a fairly large chunk of code for a single PR and I'm not sure if that's allowed. Should I just add the code to a new branch and push for review?

@jpountz
Copy link
Contributor Author

jpountz commented Oct 15, 2024

No special requirements, you may just need to adjust formatting (running ./gradlew tidy) and make sure it conforms with other requirements that are checked by the build, like forbidden APIs.

@msfroh
Copy link
Contributor

msfroh commented Oct 15, 2024

I've been thinking about this for a bit. In addition to an S3-based directory, I believe there could be some benefit from defining an S3 (or other object store) codec inspired by Parquet.

That is, the existing Lucene formats are "pure" column-stride (i.e. fields are contiguous). If we split things into "row groups", I believe we could reduce the need for random reads and prefetches. I've been thinking of giving that a try. It's complementary to the object store directory itself, so it can be worked on independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants