-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Make StreamContext
and WriteContext
more extensible
#17245
Comments
Tagging some folks who worked on #7000 for feedback: @raghuvanshraj @vikasvb90 @Bukhtawar @ashking94 @reta @gbbafna |
@jed326 This statement is incorrect. There's a |
Thanks @vikasvb90! I somehow missed that
The rest I am already handling as, like you said, instead of backing the stream by file I am backing it my a memory buffer to limit the memory consumption (for example). To that point, though, today the part size in question is calculated by the specific vendor implementation: Lines 331 to 344 in c328c18
Are there any other ways to control this part size? Since our |
Talking specifically about S3 plugin,
|
To add, there is a priority order defined to distinguish between uploads of remote cluster state (URGENT), remote translog (HIGH), remote segments (NORMAL) and large files (>15gb) + background jobs (LOW). Semaphore is acquired for NORMAL and LOW priority transfers. I believe your use case falls into NORMAL priority, so you should be good to create any number of parts. |
As you said this would only be in force merge case and we wouldn't expect this to be a common occurrence. Rough math: 1k dimension fp32 vector takes 4k bytes to store each vector, so with 10m documents the vectors would take up 40GB Thanks for the pointers on the semaphore and the priority transfers as well, will keep those in mind in our perf tuning/analysis as well. |
I think my questions are answered here so closing this issue, thanks again @vikasvb90! |
Is your feature request related to a problem? Please describe
I am working on a feature in the k-NN plugin (see: opensearch-project/k-NN#2465) for which I want to upload flat vectors to a remote repository using the
asyncBlobUpload
path, which requires both aWriteContext
and a correspondingStreamContext
.Today this
asyncBlobUpload
is used by remote store to copy segment files to the remote store after refresh, however in my use case I would like to write the flat vector files to the remote object storage during either the flush or merge operations, so in my case there is no existing file on disk yet, which is one of the assumptions taken byWriteContext
.Describe the solution you'd like
I would like to convert both
WriteContext
andStreamContext
into interfaces and refactor the current concrete implementations intoRemoteStoreWriteContext
andRemoteStoreStreamContext
so I can provide my own custom interface implementations in the k-NN plugin without the same assumptions made by the existing classes.Related component
Storage:Remote
Describe alternatives you've considered
Both
WriteContext
andStreamContext
are already public classes today, so I can still extend them from the k-NN plugin. However, this isn't ideal as, for example, I will still need to provide afileName
to the super() call toWriteContext
. Additionally, these classes are annotated as@opensearch.internal
today which implies we probably should not be extending them like so.Additional context
Other related design docs for k-NN feature:
The text was updated successfully, but these errors were encountered: