Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eventbus): warn applications whenever events aren't read #2026

Closed

Conversation

Wondertan
Copy link
Contributor

Adds logging for the EventBus, Mainly a periodical warning for applications whenever they are slow in reading or do not read EventBus notifications. It came up after debugging a very tricky case where a bugged Server side application wasn't reading EvtConnectednessChanged events, which manifested in the server's inability to accept new streams as connection handling logic deadlocked on sync Notifee's Connected callbacks which emit the events.

P.S. This change should save some time for future application developers.

Adds logging for the EventBus, Mainly a periodical warning for applications whenever they
are slow in reading or do not read EventBus notifications. It came up after debugging a
very tricky case where a bugged Server side application wasn't reading `EvtConnectednessChanged`
events, which manifested in the server's inability to accept new streams as connection handling
logic deadlocked on _sync_ Notifee's  `Connected` callbacks which emit the events.

P.S. This change should save some time for future application developers.
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't introduce a timer here. Emitting multiple log entries because you might miss a single entry doesn't sound very convincing to me. This will just spam the log.

@Wondertan
Copy link
Contributor Author

  • Timer/ticker prevents false warnings. There could be an ordinary case of a connection burst or another event that fills up the buffer to the point where the default case is hit. emitWithWarn gives the reader some time buffer to process events instead of warning instantly. Hitting default does not necessarily mean the reader is slow.
  • Spamming is not an issue if its goal is to bring attention to something critical. E.g., for CANONICAL logs, we don't produce only one log, but multiple, with sampling rate being a configuration to prevent over-spamming. Similarly, in this case, we can configure how often the warning should be produced(and after what time we see blocked event worth to be logged out)

@marten-seemann
Copy link
Contributor

marten-seemann commented Jan 29, 2023

Starting a timer to emit a log message seems very heavy. If that’s needed due to burstiness, I’m beginning to wonder if this is the right approach at all. Maybe building proper monitoring would be the better way?

I’m strongly opposed to spamming the logs. Just because a log entry is important is no justification for repeating it 10 times (per second!).

E.g., for CANONICAL logs, we don't produce only one log, but multiple, with sampling rate being a configuration to prevent over-spamming.

That case is quite the opposite. These are all separate events, and we sample them to dial down the log frequency. If at all, this is an argument why we should not spam the log.

@aviscode
Copy link

Please don't introduce a timer here. Emitting multiple log entries because you might miss a single entry doesn't sound very convincing to me. This will just spam the log.

i think we can have a retry of 3 times without a timer @marten-seemann what do u think ?

@marten-seemann
Copy link
Contributor

Please don't introduce a timer here. Emitting multiple log entries because you might miss a single entry doesn't sound very convincing to me. This will just spam the log.

i think we can have a retry of 3 times without a timer @marten-seemann what do u think ?

There's no point in retrying if you don't wait in between.

@aviscode
Copy link

Please don't introduce a timer here. Emitting multiple log entries because you might miss a single entry doesn't sound very convincing to me. This will just spam the log.

i think we can have a retry of 3 times without a timer @marten-seemann what do u think ?

There's no point in retrying if you don't wait in between.

can we wait with a sleep not with a timer ?

@marten-seemann
Copy link
Contributor

@Wondertan Can you have a look at the dashboard we're adding in #2038. There are two dashboards that should have helped you detect the bug:

  1. Subscriber Queue Length: You would have seen the queue for one of your subscribers to monotonically grow. image
  2. Subscriber Queue Filled Up: You would've seen that subscriber turn red in the state timeline: image

@marten-seemann
Copy link
Contributor

Closing since this has been addressed by adding metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants