You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The BulkIndexer allows the configuration of multiple workers via NumWorkers. Which fans out bulk requests to N go routines which manage bulk buffers and API requests separately.
There is no allocation logic for the fan out so there is no guarantee that multiple actions to the same document will exist in the same request. And more importantly it could break ordering as subsequent actions may be flushed before previous ones depending on the state of the what buffer is for that particular worker.
This can cause issues with systems processing a lot of messages for the same items such as streams from kafka where many item updates could occur for the same item. It makes it possible to trigger document not found exceptions in scenarios where 1 message creates and the next message is an update, because the update worker flushes but the create message is still on the buffer of another worker.
If document IDs can be guaranteed to be on the same worker they will either be in the same bulk request or at least in the correct order of bulk requests.
What solution would you like?
Replace the existing internal queue channel with an array of channels up to the NumWorkers.
Create a numerical hash of the DocumentID.
Use that numerical hash to determine what queue the request should be added to on the Add function.
What alternatives have you considered?
Consuming code could handle this by setting NumWorkers to 1, but this remove the horizontal scaling of this library and can cause throughput issues.
Additionally consuming code could create NBulkIndexer instances all with 1 NumWorkers and handle this hashing process there to add messages to different bulk indexers based on the hash. But it would be better if this was wrapped up in this library as it's generic.
Do you have any additional context?
N/A
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
The BulkIndexer allows the configuration of multiple workers via
NumWorkers
. Whichfans out
bulk requests to N go routines which manage bulk buffers and API requests separately.There is no allocation logic for the
fan out
so there is no guarantee that multiple actions to the same document will exist in the same request. And more importantly it could break ordering as subsequent actions may be flushed before previous ones depending on the state of the what buffer is for that particular worker.This can cause issues with systems processing a lot of messages for the same items such as streams from kafka where many item updates could occur for the same item. It makes it possible to trigger document not found exceptions in scenarios where 1 message creates and the next message is an update, because the update worker flushes but the create message is still on the buffer of another worker.
If document IDs can be guaranteed to be on the same worker they will either be in the same bulk request or at least in the correct order of bulk requests.
What solution would you like?
NumWorkers
.DocumentID
.queue
the request should be added to on theAdd
function.What alternatives have you considered?
NumWorkers
to 1, but this remove the horizontal scaling of this library and can cause throughput issues.N
BulkIndexer
instances all with 1NumWorkers
and handle this hashing process there to add messages to different bulk indexers based on the hash. But it would be better if this was wrapped up in this library as it's generic.Do you have any additional context?
N/A
The text was updated successfully, but these errors were encountered: