bf: BB-419 notification: don't wait for delivery report #2421

jonathan-gramain · 2023-06-14T00:13:08Z

Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages.

Also fix and improve logging related to processing errors and error reporting from delivery reports.
adjust notification producer params

Adjust batch.num.messages and queue.buffering.max.ms values to improve throughput a bit.

Note: those values have been applied in the field with success, they may not be the best for every customer but they
gave good result so applying them on this basis.

Adjust `batch.num.messages` and `queue.buffering.max.ms` values to improve throughput a bit. Note: those values have been applied in the field with success, they may not be the best for every customer but they gave good result so applying them on this basis.

Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages. Also fix and improve logging related to processing errors and error reporting from delivery reports.

bert-e · 2023-06-14T00:13:11Z

Hello jonathan-gramain,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

bert-e · 2023-06-14T00:13:16Z

Incorrect fix version

The Fix Version/s in issue BB-419 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

7.10.11
7.70.2
8.4.26
8.5.5
8.6.23

Please check the Fix Version/s of BB-419, or the target
branch of this pull request.

jonathan-gramain · 2023-06-14T00:18:13Z

ping

bert-e · 2023-06-14T00:18:22Z

Conflict

A conflict has been raised during the creation of
integration branch w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport with contents from bugfix/BB-419-notificationShouldntWaitForDeliveryReport
and development/7.70.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport origin/development/7.70
 $ git merge origin/bugfix/BB-419-notificationShouldntWaitForDeliveryReport
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport

francoisferrand

Regarding the "not waiting for delivery report", I doubt this will really solve the problem:

If the target kafka is really slow and we simply don't wait for the report, we will still enqueue messages, and will thus hit the same issue at another layer (because everything will be queue in librdkafka)
→ Either there is a temporary delay, the system should be size to accomodate the occasional delay; we should not queue too much, and it will continue without losing notification (for that case we should keep the delivery report, and keep using BackbeatConsumer to avoid too many in-flight requests)
→ Or the target just cannot keep up with the flow of notifications : may be better to just drop the extraneous message before even trying to send them (e.g. filter "old" notifications before sending them), with a configurable max delay
→ Or even automatically disable bucket notification to this target when we detect it is down
In 8.x/Artesca we compute metrics for bucket notification, which rely on the completion callback: it will create merge conflict, but more important it cannot be disabled on these branches
→ At the very least, need to add a flag to allow this feature to be activated
Technically, we are still using BackbeatConsumer.... but the rate limiting logic gets utterly broken : if we were to go that road, refactoring should be done to make the code simpler in that case IMHO

On the other hand, if the issue is not that "the target cluster is too slow for the workload", but only the librdkafka node does not make efficient use of node.js' workers (e.g. blocks one worker for the duration of the call), which kind of breaks the process : then indeed it may help, but at the cost of control (no handling of retry, no metrics...). And it may better be addressed by switching to another kafka client library (kafkajs?)...

All in all, I think either the problem is not properly/fully documented (e.g. root cause), or we are just trying to solve a symptom...

jonathan-gramain · 2023-06-14T17:32:39Z

After an offline discussion with @francoisferrand , it looks better to change the way we solve this problem.

The initial approach of not waiting for delivery reports, while clearly enhancing throughput, has drawbacks compared to the original one:

it does not allow anymore to commit consumer offsets on consumer side when the producer has acknowledged the delivery
as a consequence, we don't have a way to hold off consuming from the internal Kafka queue while the external one has issues or is just too slow

For CRR, we solved this issue by decoupling the processing (copying objects and producing replication status) from the delivery report management of the replication status topic, which itself allows to commit offsets on the consumer side later, after the delivery reports from the replication status topic have been received. The processing is limited by a "concurrency" parameter, to limit how many CRR operations we do in parallel, but it does not limit how many pending delivery reports we can be waiting for.

For bucket notifications, the processing essentially is internal node.js processing (synchronous), then producing messages to an external Kafka via a synchronous call, and waiting for delivery reports via a callback. For this, we do not have the same need to decouple processing from waiting for delivery reports. Hence, it looks acceptable to instead raise the concurrency of the BackbeatConsumer processing, which will define both how many notifications we process in parallel (essentially synchronous CPU usage) and also how many delivery reports we wait in parallel before consuming more messages from the internal Kafka queue. We could also play on the producer "poll" interval, hard-coded to 2 seconds now, that defines how much time we need to be aware of the pending delivery reports. But raising concurrency high enough should increase the throughput of notifications while keeping the tie between consumer and producer, so that the internal Kafka queue really acts as a buffer, and not the internal producer memory (which can run out of memory and trigger "Queue full" errors).

TLDR; what we should do:

raise the default concurrency of notification consumers, say from 10 to 100 or more
make the concurrency configurable in Federation, like done for CRR
ensure we can change the number of bucket notification containers and partitions to scale out consumption of notifications (should be already the case)
optionally, as a nice to have, we could make the producer poll interval configurable as well (to reduce the time it takes to fully process an entry), but changing the concurrency alone should be enough

jonathan-gramain added 3 commits June 13, 2023 17:10

bf: BB-419 notification: don't wait for delivery report

48552f6

Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages. Also fix and improve logging related to processing errors and error reporting from delivery reports.

BB-419 bump Backbeat version

0d3323a

jonathan-gramain requested a review from naren-rajendran June 14, 2023 00:13

naren-rajendran approved these changes Jun 14, 2023

View reviewed changes

francoisferrand requested a review from Kerkesni June 14, 2023 04:28

francoisferrand requested changes Jun 14, 2023

View reviewed changes

jonathan-gramain closed this Jun 14, 2023

jonathan-gramain mentioned this pull request Jun 14, 2023

bf: BB-419 increase notification default concurrency #2422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bf: BB-419 notification: don't wait for delivery report #2421

bf: BB-419 notification: don't wait for delivery report #2421

jonathan-gramain commented Jun 14, 2023

bert-e commented Jun 14, 2023

bert-e commented Jun 14, 2023

jonathan-gramain commented Jun 14, 2023

bert-e commented Jun 14, 2023

francoisferrand left a comment

jonathan-gramain commented Jun 14, 2023

bf: BB-419 notification: don't wait for delivery report #2421

bf: BB-419 notification: don't wait for delivery report #2421

Conversation

jonathan-gramain commented Jun 14, 2023

bert-e commented Jun 14, 2023

Hello jonathan-gramain,

bert-e commented Jun 14, 2023

Incorrect fix version

jonathan-gramain commented Jun 14, 2023

bert-e commented Jun 14, 2023

Conflict

francoisferrand left a comment

Choose a reason for hiding this comment

jonathan-gramain commented Jun 14, 2023