-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MessageTimeToLiveChecker clogs the log stream with commands #11762
Comments
Maybe we can find a solution which we can also apply to the other cases like job timeouts, times triggers, multi instance activation etc. Not sure what your idea of solving was but I was thinking whether we could introduce a new command which contains the count or keys as list. Like instead of writting 1k message delete commands we write one with an list of all keys we want to delete. |
Thanks @Zelldon those are interesting points.
Absolutely, we should keep this in mind. I'm not sure how much of this issue will result in changes that are reusable for those concepts, it may be more relevant for #11761 or other issues coming out of #11591
We will tackle this idea in a separate issue. It's already tracked in #11591, but no issue has been created for it yet. |
@romansmirnov it appears that this is not possible without altering the interface between the Stream Processor and the Engine significantly. The reason is that the checker is just a scheduled Task. When executed, a task receives a TaskResultBuilder, which can be used to construct the Record Batch by appending commands to the builder. The Stream Processor then takes that record batch and appends it to the log. There is currently no way for the checker to provide a smaller record batch and continue to append to a new record batch. I suggest we simply limit the number of commands appended to the record batch, i.e. the task result. Note that this means that after the task has completed, the checker is rescheduled in 60 seconds. We can make this interval configurable in a separate issue. @romansmirnov Is that okay with you? |
One thing to consider here is that this should not result in more expired messages being accumulated in the state. The use case we are considering had around 3k messages produced per minute. So 3k messages should be also deleted per minute, otherwise the state will grow unbounded. |
@deepthidevaki probably a good test/benchmark to verify after |
I agree with @deepthidevaki, we should make sure that it does not result in more expired messages being accumulated in the state. Only limiting the number of commands to append and making the interval configurable does not prevent this from happening. I believe there could be different approaches how to achieve the desired behavior, i.e., submitting batches with a limited number of commands and still continue collecting the next one. Some potential approaches:
When the checker gets invoked after self-scheduling, it still references the last received expired message key (and the boundary), that way it could continue iterating from that key again and collect the next 10 messages. When there are no expired messages left, it will schedule the checker by applying the 60s (i.e., the configurable interval). The second approach would require changes on both sides of the abstraction, in the checker, in zb-db (i.e., the iterator getting a start position), and maybe the scheduling service to make it work. @deepthidevaki & @korthout, what do you think? |
I much prefer the second option:
If we were to add an option to start the Iterator at a specific start position (e.g. seek to a prefix), then I see two options:
Since ZDP owns the EDIT: Closing this happened by misclick 😅 |
Hey @korthout I guess it makes sense to add something like an overload of the method, shouldn't be that hard I guess. We already use the seek internally in these methods anyway, due to our column family usage (we have only one column family but all keys are prefixed with the enum ordinal). As you described an separate seek doesn't really make sense since we would need to keep the iterator or the prefix in memory, and we don't know when to throw it away again. @megglos this is something we would need to provide. @korthout can you create a separate issue for this |
The second option is cleaner in my opinion also. |
@korthout, thanks for your feedback! I agree with what you and @deepthidevaki and @Zelldon wrote about preferring the second option. Besides that, I do have not much to add anymore. Thanks for all your input! |
11785: feat: start db iteration at a specified key r=oleschoenburg a=oleschoenburg Adds an additional method to ColumnFamily, whileTrue with a startAt key. relates to #11762 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
11785: feat: start db iteration at a specified key r=oleschoenburg a=oleschoenburg Adds an additional method to ColumnFamily, whileTrue with a startAt key. relates to #11762 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
11782: [Backport stable/8.1] feat: start db iteration at a specified key r=oleschoenburg a=oleschoenburg Adds an additional method to `ColumnFamily`, `whileTrue` with a `startAt` key. relates to #11762 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
11856: [stable/8.1] Limit expire message commands in result r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> This adds a virtual limit to the number of EXPIRE Message commands the MessageTimeToLiveChecker adds to a result when executed to prevent that it clogs the log stream with too many of these commands. I suggest to review this pull request per commit. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #11762 Co-authored-by: Nico Korthout <nico.korthout@camunda.com> Co-authored-by: Nico Korthout <korthout@users.noreply.github.com>
Describe the bug
The
MessageTimeToLiveChecker
appendsMessage:Expire
commands to the log for each message that has surpassed its TTL. If there are many messages in the state with an expired TTL, it will write many commands to the log in a single batch. This means the engine doesn't have time to process anything else until it processed all the commands in that batch. This can lead to latency spikes, see #11591.To Reproduce
Expected behavior
The
MessageTimeToLiveChecker
should limit the number of message expiration commands it appends in a single batch.Since there can still be more messages with an expired TTL in the state (and the state is already being iterated), the checker should just continue after appending the batch by starting to append to another batch. Once the checker is out of messages to expire, it can be rescheduled to run after the fixed interval that currently also already exists.
The text was updated successfully, but these errors were encountered: