-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an S3-based directory. #13868
Comments
Nice idea @jpountz! I'll try spending sometime working on this. |
@jpountz Interestingly, I have been working on this and was about to open an issue - if its ok, I will self assign this? |
@albogdano I'm curious if you have any interest in contributing your https://github.com/albogdano/lucene-s3directory? @shubhamvishu @atris Thanks for volunteering to help! I'm keen on checking if @albogdano has interest in contributing first, but even if we went that route I'm sure we'll need help to properly support new Directory APIs like |
@jpountz Yes! How can I help you guys? My knowledge of Lucene internals is quite limited and the goal of the |
I'm thinking of a PR that would create a new
What is your gut feeling: should we rather start with your code and iterate on it to make it production-ready, or would it be easier/better to start from scratch, just taking inspiration from your existing code? |
You will save some time if you iterate on my code as it already implements the boring parts for integrating with the AWS SDK (version 2.x is used btw). All I did was to clone Shay's |
Sounds good. Would you like to work on the PR? |
Yes, of course! Are there any requirements for the PR? It would be a fairly large chunk of code for a single PR and I'm not sure if that's allowed. Should I just add the code to a new branch and push for review? |
No special requirements, you may just need to adjust formatting (running |
I've been thinking about this for a bit. In addition to an S3-based directory, I believe there could be some benefit from defining an S3 (or other object store) codec inspired by Parquet. That is, the existing Lucene formats are "pure" column-stride (i.e. fields are contiguous). If we split things into "row groups", I believe we could reduce the need for random reads and prefetches. I've been thinking of giving that a try. It's complementary to the object store directory itself, so it can be worked on independently. |
There are so many projects (latest one I've heard of is Nixiesearch, presented at Haystack) trying to read Lucene indexes from S3, let's provide a S3-based (and other object stores) directory in Lucene directly?
The text was updated successfully, but these errors were encountered: