[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

mikesmitharoo · 2024-01-24T14:33:53Z

Is your feature request related to a problem?

The BulkIndexer allows the configuration of multiple workers via NumWorkers. Which fans out bulk requests to N go routines which manage bulk buffers and API requests separately.

There is no allocation logic for the fan out so there is no guarantee that multiple actions to the same document will exist in the same request. And more importantly it could break ordering as subsequent actions may be flushed before previous ones depending on the state of the what buffer is for that particular worker.

This can cause issues with systems processing a lot of messages for the same items such as streams from kafka where many item updates could occur for the same item. It makes it possible to trigger document not found exceptions in scenarios where 1 message creates and the next message is an update, because the update worker flushes but the create message is still on the buffer of another worker.

If document IDs can be guaranteed to be on the same worker they will either be in the same bulk request or at least in the correct order of bulk requests.

What solution would you like?

Replace the existing internal queue channel with an array of channels up to the NumWorkers.
Create a numerical hash of the DocumentID.
Use that numerical hash to determine what queue the request should be added to on the Add function.

What alternatives have you considered?

Consuming code could handle this by setting NumWorkers to 1, but this remove the horizontal scaling of this library and can cause throughput issues.
Additionally consuming code could create N BulkIndexer instances all with 1 NumWorkers and handle this hashing process there to add messages to different bulk indexers based on the hash. But it would be better if this was wrapped up in this library as it's generic.

Do you have any additional context?

N/A

The text was updated successfully, but these errors were encountered:

dblock · 2024-06-17T16:27:38Z

This is a good idea. Want to try to implement this @mikesmitharoo?

Catch All Triage - 1 2 3 4 5

mikesmitharoo added enhancement New feature or request untriaged labels Jan 24, 2024

dblock removed the untriaged label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

mikesmitharoo commented Jan 24, 2024

dblock commented Jun 17, 2024

[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

[FEATURE] BulkIndexer - ensure that subsequent requests on the same documentID are consistently sent to the same worker #464

Comments

mikesmitharoo commented Jan 24, 2024

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

dblock commented Jun 17, 2024