Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat[MQB]: Enhance queue consumption monitor alarm log with additional details #420

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

alexander-e1off
Copy link
Collaborator

@alexander-e1off alexander-e1off commented Sep 13, 2024

When a queue starts to fill up, it is valuable to see information about which AppIds are impacted, and information about the messages in the queue.
Especially in the case of subscriptions (which we are enabling for everyone now), messages that match no subscription expression will build up in the put aside list.
To help make this situation clearer to operators and users (what apps are impacted, why are messages building up, how old is the head of the queue for each app, etc), we can log more information when the watermark alarm is triggered:

  • Sizes of put-aside list and redelivery list for each app for that queue;
  • Oldest message's timestamp in put aside list & its message properties;
  • Number of unconfirmed messages;
  • Total size of messages for each app (but seems it is already done by storage()->capacityMeter()->printShortSummary() );

This is to help debug why a message doesn't match a subscription.

Alarm log looks like this:

ERROR mqbblp_rootqueueengine.cpp:1760 ALARM [QUEUE_STUCK] Queue 'bmq://bmq.test.mem.fanout/my1?id=foo' Messages [current: 2 / 1,000], Bytes [current: 66  B / 1.00 MB], max idle time 20.00 s appears to be stuck. It currently has 2 consumers.
  1. bmqtool.tsk:369163@127.0.0.1~localhost:35002
    Handle Parameters .....: [ uri = "bmq://bmq.test.mem.fanout/my1" qId = 0 subIdInfo = NULL flags = 6 readCount = 1 writeCount = 1 adminCount = 0 ]
    Unconfirmed messages count: 0
    UnconfirmedMonitors ....:
  0x7f166003f8a0
  0x7f166001bfa0
  2. bmqtool.tsk:369195@127.0.0.1~localhost:56108
    Handle Parameters .....: [ uri = "bmq://bmq.test.mem.fanout/my1" qId = 0 subIdInfo = NULL flags = 2 readCount = 1 writeCount = 0 adminCount = 0 ]
    Unconfirmed messages count: 0
    UnconfirmedMonitors ....:
  0x7f1660047280
  0x7f168c0011a0

Put aside list size: 2
Redelivery list size: 0

Consumer subscription expressions: 
y == 4
x == 2
y == 3
x == 1

Oldest message in a 'Put aside' list:
GUID                              Size        Timestamp (UTC)
40000000000E0907FBA4D5FCCF107C0E       55  B  13SEP2024_12:41:25.691464+0000
Message Properties: [ sample_str (STRING) = "foo bar" x (INT32) = 10 ]

10 oldest messages in the queue:
Printing 2 message(s) [0-1 / 2] (total: 66  B)
       GUID                              Size        Timestamp (UTC)
    0: 40000000000E0907FBA4D5FCCF107C0E       55  B  13SEP2024_12:41:25.691464+0000
    1: 40000100000F4FC18E1CD5FCCF107C0E       11  B  13SEP2024_12:41:31.172941+0000

Current head of the queue:
GUID                              Size        Timestamp (UTC)
40000000000E0907FBA4D5FCCF107C0E       55  B  13SEP2024_12:41:25.691464+0000

Implementation details:

  • Log alarm logic is moved from QueueConsumptionMonitor into RootQueueEngine class, where more data is available;
  • Callback is passed to QueueConsumptionMonitor and called in case of alarm to log alarm data;
  • TODO: unit test for RootQueueEngine::logAlarmCb;

alexander-e1off and others added 27 commits March 28, 2024 18:28
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Few questions.

  • Maybe, we can print Storage::numMessages and Storage::numBytes as well?

  • Can we get rid of QueueEngineUtil_AppState::head() and QueueConsumptionMonitor::SubStreamInfo::d_headCb now?

  • Is QueueConsumptionMonitor::onTransitionToIdle "level triggered" (vs "edge triggered")? It can be noisy, how often we want that log?

@chrisbeard please take a look at the output

@alexander-e1off
Copy link
Collaborator Author

Regarding Maybe, we can print Storage::numMessages and Storage::numBytes as well?. CapacityMeter::printShortSummary() already prints numMessages and numBytes, e.g.

Messages [current: 2 / 1,000], Bytes [current: 66  B / 1.00 MB]

Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they printed the same values. Are there any possible scenarios when they will differ?

@alexander-e1off
Copy link
Collaborator Author

alexander-e1off commented Sep 16, 2024

Regarding Can we get rid of QueueEngineUtil_AppState::head() and QueueConsumptionMonitor::SubStreamInfo::d_headCb now? Now we print oldest message from put aside list, but what happens if we have empty put aside list and not empty redelivery list? For example, there are 4 messages, and I limited to print only 3 oldest messages. Without printing head we don't know what is the last message in queue:

ERROR mqbblp_rootqueueengine.cpp:1766 ALARM [QUEUE_STUCK] Queue 'bmq://bmq.test.mem.fanout/my1?id=foo' Messages [current: 4 / 1,000], Bytes [current: 44  B / 1.00 MB], max idle time 20.00 s appears to be stuck. It currently has 0 consumers.

Put aside list size: 0
Redelivery list size: 3

3 oldest messages in the queue:
Printing 3 message(s) [0-2 / 4] (total: 33  B)
       GUID                              Size        Timestamp (UTC)
    0: 400000000011896EDA51C2376057798E       11  B  17SEP2024_12:23:02.458416+0000
    1: 400001000012F52671B0C2376057798E       11  B  17SEP2024_12:23:08.559553+0000
    2: 4000020000132AB6F3D6C2376057798E       11  B  17SEP2024_12:23:09.458243+0000

Current head of the queue:
GUID                              Size        Timestamp (UTC)
40000300001A6876353BC2376057798E       11  B  17SEP2024_12:23:40.559572+0000

Is head info valuable in this scenario?

Regarding Is QueueConsumptionMonitor::onTransitionToIdle "level triggered" (vs "edge triggered")? - QueueConsumptionMonitor::onTransitionToIdle is edge triggered, so onTransitionToIdle is called only once when state is transitioned from Active to Idle.

Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
@dorjesinpo
Copy link
Collaborator

dorjesinpo commented Sep 18, 2024

Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they

Looking at the FileBackedStorage::numMessages they are the same if asking for the queue stats and not appId stats.

It may make sense to print the stats for the App (vs the entire queue)

@alexander-e1off
Copy link
Collaborator Author

Aren't they the same as Storage::numMessages() and Storage::numBytes? In my test they

Looking at the FileBackedStorage::numMessages they are the same if asking for the queue stats and not appId stats.

Thanks, make sense, so I will add Storage::numMessages and Storage::numBytes per appId (appKey).

Signed-off-by: Aleksandr Ivanov <aivanov71@bloomberg.net>
@dorjesinpo
Copy link
Collaborator

Regarding Can we get rid of QueueEngineUtil_AppState::head() and QueueConsumptionMonitor::SubStreamInfo::d_headCb now?

d_headCb is QueueEngineUtil_AppState::head() which is owned by the QueueEngine. So, there seems to be no need to pass it to RootQueueEngine::logAlarmCb.
QueueEngineUtil_AppState::head() should have answers about empty list(s) scenarios. We may want to revisit the logic though. Currently, we print put-aside if not empty, and the storage otherwise.

@dorjesinpo
Copy link
Collaborator

QueueConsumptionMonitor::onTransitionToIdle is edge triggered, so onTransitionToIdle is called only once when state is transitioned from Active to Idle.

Ok. This can flap, we often see a lot of subsequent logs. We are increasing the size of what's logged, so we may want to throttle the logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants