Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

Open
mikesmitharoo opened this issue Jan 24, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@mikesmitharoo
Copy link

Is your feature request related to a problem?

The BulkIndexer allows the configuration of multiple workers via NumWorkers. Which fans out bulk requests to N go routines which manage bulk buffers and API requests separately.

There is no allocation logic for the fan out so there is no guarantee that multiple actions to the same document will exist in the same request. And more importantly it could break ordering as subsequent actions may be flushed before previous ones depending on the state of the what buffer is for that particular worker.

This can cause issues with systems processing a lot of messages for the same items such as streams from kafka where many item updates could occur for the same item. It makes it possible to trigger document not found exceptions in scenarios where 1 message creates and the next message is an update, because the update worker flushes but the create message is still on the buffer of another worker.

If document IDs can be guaranteed to be on the same worker they will either be in the same bulk request or at least in the correct order of bulk requests.

What solution would you like?

  • Replace the existing internal queue channel with an array of channels up to the NumWorkers.
  • Create a numerical hash of the DocumentID.
  • Use that numerical hash to determine what queue the request should be added to on the Add function.

What alternatives have you considered?

  • Consuming code could handle this by setting NumWorkers to 1, but this remove the horizontal scaling of this library and can cause throughput issues.
  • Additionally consuming code could create N BulkIndexer instances all with 1 NumWorkers and handle this hashing process there to add messages to different bulk indexers based on the hash. But it would be better if this was wrapped up in this library as it's generic.

Do you have any additional context?

N/A

@mikesmitharoo mikesmitharoo added enhancement New feature or request untriaged labels Jan 24, 2024
@dblock
Copy link
Member

dblock commented Jun 17, 2024

This is a good idea. Want to try to implement this @mikesmitharoo?

Catch All Triage - 1 2 3 4 5

@dblock dblock removed the untriaged label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants