-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bf: BB-419 notification: don't wait for delivery report #2421
bf: BB-419 notification: don't wait for delivery report #2421
Conversation
Adjust `batch.num.messages` and `queue.buffering.max.ms` values to improve throughput a bit. Note: those values have been applied in the field with success, they may not be the best for every customer but they gave good result so applying them on this basis.
Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages. Also fix and improve logging related to processing errors and error reporting from delivery reports.
Hello jonathan-gramain,My role is to assist you with the merge of this Status report is not available. |
ping |
ConflictA conflict has been raised during the creation of I have not created the integration branch. Here are the steps to resolve this conflict: $ git fetch
$ git checkout -B w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport origin/development/7.70
$ git merge origin/bugfix/BB-419-notificationShouldntWaitForDeliveryReport
$ # <intense conflict resolution>
$ git commit
$ git push -u origin w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the "not waiting for delivery report", I doubt this will really solve the problem:
- If the target kafka is really slow and we simply don't wait for the report, we will still enqueue messages, and will thus hit the same issue at another layer (because everything will be queue in
librdkafka
)
→ Either there is a temporary delay, the system should be size to accomodate the occasional delay; we should not queue too much, and it will continue without losing notification (for that case we should keep the delivery report, and keep usingBackbeatConsumer
to avoid too many in-flight requests)
→ Or the target just cannot keep up with the flow of notifications : may be better to just drop the extraneous message before even trying to send them (e.g. filter "old" notifications before sending them), with a configurable max delay
→ Or even automatically disable bucket notification to this target when we detect it is down - In 8.x/Artesca we compute metrics for bucket notification, which rely on the completion callback: it will create merge conflict, but more important it cannot be disabled on these branches
→ At the very least, need to add a flag to allow this feature to be activated - Technically, we are still using BackbeatConsumer.... but the rate limiting logic gets utterly broken : if we were to go that road, refactoring should be done to make the code simpler in that case IMHO
On the other hand, if the issue is not that "the target cluster is too slow for the workload", but only the librdkafka
node does not make efficient use of node.js' workers (e.g. blocks one worker for the duration of the call), which kind of breaks the process : then indeed it may help, but at the cost of control (no handling of retry, no metrics...). And it may better be addressed by switching to another kafka client library (kafkajs?)...
All in all, I think either the problem is not properly/fully documented (e.g. root cause), or we are just trying to solve a symptom...
After an offline discussion with @francoisferrand , it looks better to change the way we solve this problem. The initial approach of not waiting for delivery reports, while clearly enhancing throughput, has drawbacks compared to the original one:
For CRR, we solved this issue by decoupling the processing (copying objects and producing replication status) from the delivery report management of the replication status topic, which itself allows to commit offsets on the consumer side later, after the delivery reports from the replication status topic have been received. The processing is limited by a "concurrency" parameter, to limit how many CRR operations we do in parallel, but it does not limit how many pending delivery reports we can be waiting for. For bucket notifications, the processing essentially is internal node.js processing (synchronous), then producing messages to an external Kafka via a synchronous call, and waiting for delivery reports via a callback. For this, we do not have the same need to decouple processing from waiting for delivery reports. Hence, it looks acceptable to instead raise the concurrency of the BackbeatConsumer processing, which will define both how many notifications we process in parallel (essentially synchronous CPU usage) and also how many delivery reports we wait in parallel before consuming more messages from the internal Kafka queue. We could also play on the producer "poll" interval, hard-coded to 2 seconds now, that defines how much time we need to be aware of the pending delivery reports. But raising concurrency high enough should increase the throughput of notifications while keeping the tie between consumer and producer, so that the internal Kafka queue really acts as a buffer, and not the internal producer memory (which can run out of memory and trigger "Queue full" errors). TLDR; what we should do:
|
Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages.
Also fix and improve logging related to processing errors and error reporting from delivery reports.
adjust notification producer params
Adjust
batch.num.messages
andqueue.buffering.max.ms
values to improve throughput a bit.Note: those values have been applied in the field with success, they may not be the best for every customer but they
gave good result so applying them on this basis.