Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bf: BB-419 notification: don't wait for delivery report #2421

Conversation

jonathan-gramain
Copy link
Contributor

  • Notification QueueProcessor doesn't wait for Kafka delivery reports anymore, in order to speed up delivery of messages.

    Also fix and improve logging related to processing errors and error reporting from delivery reports.

  • adjust notification producer params

    Adjust batch.num.messages and queue.buffering.max.ms values to improve throughput a bit.

    Note: those values have been applied in the field with success, they may not be the best for every customer but they
    gave good result so applying them on this basis.

Adjust `batch.num.messages` and `queue.buffering.max.ms` values to
improve throughput a bit.

Note: those values have been applied in the field with success, they
may not be the best for every customer but they gave good result so
applying them on this basis.
Notification QueueProcessor doesn't wait for Kafka delivery reports
anymore, in order to speed up delivery of messages.

Also fix and improve logging related to processing errors and error
reporting from delivery reports.
@bert-e
Copy link
Contributor

bert-e commented Jun 14, 2023

Hello jonathan-gramain,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Jun 14, 2023

Incorrect fix version

The Fix Version/s in issue BB-419 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 7.10.11

  • 7.70.2

  • 8.4.26

  • 8.5.5

  • 8.6.23

Please check the Fix Version/s of BB-419, or the target
branch of this pull request.

@jonathan-gramain
Copy link
Contributor Author

ping

@bert-e
Copy link
Contributor

bert-e commented Jun 14, 2023

Conflict

A conflict has been raised during the creation of
integration branch w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport with contents from bugfix/BB-419-notificationShouldntWaitForDeliveryReport
and development/7.70.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport origin/development/7.70
 $ git merge origin/bugfix/BB-419-notificationShouldntWaitForDeliveryReport
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/7.70/bugfix/BB-419-notificationShouldntWaitForDeliveryReport

Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the "not waiting for delivery report", I doubt this will really solve the problem:

  • If the target kafka is really slow and we simply don't wait for the report, we will still enqueue messages, and will thus hit the same issue at another layer (because everything will be queue in librdkafka)
    → Either there is a temporary delay, the system should be size to accomodate the occasional delay; we should not queue too much, and it will continue without losing notification (for that case we should keep the delivery report, and keep using BackbeatConsumer to avoid too many in-flight requests)
    → Or the target just cannot keep up with the flow of notifications : may be better to just drop the extraneous message before even trying to send them (e.g. filter "old" notifications before sending them), with a configurable max delay
    → Or even automatically disable bucket notification to this target when we detect it is down
  • In 8.x/Artesca we compute metrics for bucket notification, which rely on the completion callback: it will create merge conflict, but more important it cannot be disabled on these branches
    → At the very least, need to add a flag to allow this feature to be activated
  • Technically, we are still using BackbeatConsumer.... but the rate limiting logic gets utterly broken : if we were to go that road, refactoring should be done to make the code simpler in that case IMHO

On the other hand, if the issue is not that "the target cluster is too slow for the workload", but only the librdkafka node does not make efficient use of node.js' workers (e.g. blocks one worker for the duration of the call), which kind of breaks the process : then indeed it may help, but at the cost of control (no handling of retry, no metrics...). And it may better be addressed by switching to another kafka client library (kafkajs?)...

All in all, I think either the problem is not properly/fully documented (e.g. root cause), or we are just trying to solve a symptom...

@jonathan-gramain
Copy link
Contributor Author

After an offline discussion with @francoisferrand , it looks better to change the way we solve this problem.

The initial approach of not waiting for delivery reports, while clearly enhancing throughput, has drawbacks compared to the original one:

  • it does not allow anymore to commit consumer offsets on consumer side when the producer has acknowledged the delivery
  • as a consequence, we don't have a way to hold off consuming from the internal Kafka queue while the external one has issues or is just too slow

For CRR, we solved this issue by decoupling the processing (copying objects and producing replication status) from the delivery report management of the replication status topic, which itself allows to commit offsets on the consumer side later, after the delivery reports from the replication status topic have been received. The processing is limited by a "concurrency" parameter, to limit how many CRR operations we do in parallel, but it does not limit how many pending delivery reports we can be waiting for.

For bucket notifications, the processing essentially is internal node.js processing (synchronous), then producing messages to an external Kafka via a synchronous call, and waiting for delivery reports via a callback. For this, we do not have the same need to decouple processing from waiting for delivery reports. Hence, it looks acceptable to instead raise the concurrency of the BackbeatConsumer processing, which will define both how many notifications we process in parallel (essentially synchronous CPU usage) and also how many delivery reports we wait in parallel before consuming more messages from the internal Kafka queue. We could also play on the producer "poll" interval, hard-coded to 2 seconds now, that defines how much time we need to be aware of the pending delivery reports. But raising concurrency high enough should increase the throughput of notifications while keeping the tie between consumer and producer, so that the internal Kafka queue really acts as a buffer, and not the internal producer memory (which can run out of memory and trigger "Queue full" errors).

TLDR; what we should do:

  • raise the default concurrency of notification consumers, say from 10 to 100 or more
  • make the concurrency configurable in Federation, like done for CRR
  • ensure we can change the number of bucket notification containers and partitions to scale out consumption of notifications (should be already the case)
  • optionally, as a nice to have, we could make the producer poll interval configurable as well (to reduce the time it takes to fully process an entry), but changing the concurrency alone should be enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants