-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitions gets unhealthy if many timers are scheduled #8991
Comments
Seems like we did not triage this, and this was marked as critical. Let's talk about it tomorrow - I guess the board switch caused me to miss this one. |
This also occurred in a simpler process. E.g. in Roman's benchmark he wants to have a certain load on the system. He starts a number of process instance e.g. 10/s. Each process waits at two points for 20 minutes. Because of the continuous creation of process instances, at any point in time, there are a couple of timers ready to be triggered. Depending on the load the performance is heavily degraded. |
9217: deps(maven): bump protobuf-bom from 3.20.0 to 3.20.1 r=npepinpe a=dependabot[bot] Bumps [protobuf-bom](https://github.com/protocolbuffers/protobuf) from 3.20.0 to 3.20.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf-bom's releases</a>.</em></p> <blockquote> <h2>Protocol Buffers v3.20.1</h2> <h1>PHP</h1> <ul> <li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li> <li>Fixed composer.json to only advertise compatibility with PHP 7.0+. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li> </ul> <h1>Ruby</h1> <ul> <li>Disable the aarch64 build on macOS until it can be fixed. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li> </ul> <h1>Other</h1> <ul> <li>Fix versioning issues in 3.20.0</li> </ul> <h2>Protocol Buffers v3.20.1-rc1</h2> <p>#PHP</p> <ul> <li>Fix building packaged PHP extension (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9727">#9727</a>)</li> </ul> <p>#Other</p> <ul> <li>Fix versioning issues in 3.20.0</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/protocolbuffers/protobuf/commit/21027a27c4c2ec1000859ccbcfff46d83b16e1ed"><code>21027a2</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9832">#9832</a> from haberman/php-dist</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/634a5681b8f249d4cce876e36d179d9c8a8e12d2"><code>634a568</code></a> Fixed PHP release script.</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/616a7ef54d3b157b4a68054f705ba70929fb094a"><code>616a7ef</code></a> Updated CHANGES.txt for 3.20.1 (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9826">#9826</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/296c1fdaff6fd16b319633fe215d890b90342304"><code>296c1fd</code></a> Update protobuf version (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9820">#9820</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/6387f95aa974bc24907896d4099deb98241602e8"><code>6387f95</code></a> fix: move composer.json out of root to avoid confusion (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9819">#9819</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/653de45a2174e0a59ae8ec550b6383399af9357c"><code>653de45</code></a> Disable the aarch64 build on macOS because it is broken. (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9816">#9816</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/acd24bcd8dbd3834050f0c70a45c51d4fb619db8"><code>acd24bc</code></a> Add disallowment of setting numpy singleton arrays and multi-dimensio… (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9788">#9788</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/26d11fdba2fb6435ee4bef8a20fa011574397225"><code>26d11fd</code></a> -Dsurefire.failIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9761">#9761</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/044cb7fd45cdf112b7682381c3be2f23aedd1f98"><code>044cb7f</code></a> -DfailIfNoSpecifiedTests=false (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/9753">#9753</a>)</li> <li><a href="https://github.com/protocolbuffers/protobuf/commit/7db4eca77f2b03f93632edca5825f33ab65590e7"><code>7db4eca</code></a> Update changelogs for 3.20.1-rc1</li> <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.0...v3.20.1">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9218: deps(maven): bump maven-antrun-plugin from 3.0.0 to 3.1.0 r=npepinpe a=dependabot[bot] Bumps [maven-antrun-plugin](https://github.com/apache/maven-antrun-plugin) from 3.0.0 to 3.1.0. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/feceee80f82e1f52a8233a19015c1ed0e4cef2ef"><code>feceee8</code></a> [maven-release-plugin] prepare release maven-antrun-plugin-3.1.0</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/4e899c4314aa9e1d8ee0e6b17844c849359c42ce"><code>4e899c4</code></a> [MANTRUN-236] Upgrade Maven Plugin Plugin to 3.6.4</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/61c21bd0e95cb2e4433492f00eb473d5b39fb27d"><code>61c21bd</code></a> Update CI URL</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/b44673765e5cb4990e779d9115226510403273a7"><code>b446737</code></a> Bump checkstyle from 9.2.1 to 9.3</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/43cb90edb00dfd305574820b32d368a4ae25008d"><code>43cb90e</code></a> [MANTRUN-232] Require Maven 3.2.5 - set proper maven scopes</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/2274253cdaed396d1de6964e1c6f93ebc7502141"><code>2274253</code></a> Shared GitHub Actions v2</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/7911362e0ac86b5a1120bc8f6ef4512222c342ec"><code>7911362</code></a> Bump checkstyle from 9.0.1 to 9.2.1</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/5bb03c35533255f04d98267c5662d3b2272835c2"><code>5bb03c3</code></a> (doc) Update plugins before release</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/e773180d4e7d1b661efe6fa7882fcd24c11e84bc"><code>e773180</code></a> Bump maven-site-plugin from 3.9.1 to 3.10.0</li> <li><a href="https://github.com/apache/maven-antrun-plugin/commit/68fc8330e278b92829168e66065b5fa32086df93"><code>68fc833</code></a> Bump xmlunit-matchers from 2.8.3 to 2.8.4</li> <li>Additional commits viewable in <a href="https://github.com/apache/maven-antrun-plugin/compare/maven-antrun-plugin-3.0.0...maven-antrun-plugin-3.1.0">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9226: deps(maven): bump spotless-maven-plugin from 2.22.1 to 2.22.2 r=npepinpe a=dependabot[bot] Bumps [spotless-maven-plugin](https://github.com/diffplug/spotless) from 2.22.1 to 2.22.2. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/diffplug/spotless/blob/main/CHANGES.md">spotless-maven-plugin's changelog</a>.</em></p> <blockquote> <h2>[2.22.2] - 2022-02-09</h2> <h3>Changed</h3> <ul> <li>Bump default ktfmt <code>0.30</code> -> <code>0.31</code> (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1118">#1118</a>).</li> </ul> <h3>Fixed</h3> <ul> <li>Add full support for git worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/pull/1119">#1119</a>).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/diffplug/spotless/commit/19d271c641aab4b16af67287ed835d91e5e076e0"><code>19d271c</code></a> Published lib/2.22.2</li> <li><a href="https://github.com/diffplug/spotless/commit/693803ca42be848e69be3d87243d0d66ba4f011a"><code>693803c</code></a> Add support for worktrees (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1119">#1119</a>)</li> <li><a href="https://github.com/diffplug/spotless/commit/6a03212a6fdb8417b41b54c5e1436558e798cb9b"><code>6a03212</code></a> Double the Gradle metaspace to help with flaky CI.</li> <li><a href="https://github.com/diffplug/spotless/commit/27d16456d0fb32ea7f842ca08e0cddfe0698118f"><code>27d1645</code></a> Oops, fix the changelog entries.</li> <li><a href="https://github.com/diffplug/spotless/commit/e4d025b4cf1e829048260ee0f75a1a15ea418764"><code>e4d025b</code></a> Fix changelog merge conflicts.</li> <li><a href="https://github.com/diffplug/spotless/commit/0a44147c9a1072251346d14562238c9e157047fd"><code>0a44147</code></a> Bump default ktfmt 0.30 -> 0.31. (<a href="https://github-redirect.dependabot.com/diffplug/spotless/issues/1118">#1118</a>)</li> <li><a href="https://github.com/diffplug/spotless/commit/68893214eee8569219e695a26771b15e41305ccc"><code>6889321</code></a> Fix spotbugs warning.</li> <li><a href="https://github.com/diffplug/spotless/commit/9d3a4bdf7d9ad1aa391496acace9e95eaab33995"><code>9d3a4bd</code></a> Update changelogs.</li> <li><a href="https://github.com/diffplug/spotless/commit/4b47e1c6d7e3d9d6d9968631c47a843bd18f782f"><code>4b47e1c</code></a> Update changelog</li> <li><a href="https://github.com/diffplug/spotless/commit/c3a32ee26afa83ff04425b58736a0e4f8812563f"><code>c3a32ee</code></a> Add support for worktrees</li> <li>Additional commits viewable in <a href="https://github.com/diffplug/spotless/compare/lib/2.22.1...lib/2.22.2">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9230: Renaming and javadocs r=remcowesterhoud a=pihme ## Description - rename dueDateKey -> dueDate - because it is a due date, not a key - rename findTimersWithDueDateBefore(...) -> processTimersWithDueDateBefore(...) - it's a more precise description of the method - add JavaDoc comments to TimerInstanceState interface ## Related issues related to #8991 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: pihme <pihme@users.noreply.github.com>
Roman's workflow looked like this: I guess it can be simplified to: But maybe the other elements are necessary to add latency between the time a command is written to the log and the time a record is processed. This is just to say it can also happen without specially crafted process from Game days. Looking at the code I can see the following:
This also means that the So my working hypothesis is that the there is an escalating feedback loop.
Could also be that this happens more prominently after a broker resumes, because much time has passed since the last run. This could also explain Romans observation that one some days he can get stable performance and on others not. /cc @Zelldon @saig0 wdyt? |
@pihme yes. That sounds correct. But there is one more aspect. If the A solution could be to limit the number of timers for one check (e.g. only 100 timers at once). |
Not sure this would be a valid solution. My fear is that it will just always be falling further and further behind. Right now the highest possible frequency of My hope is that if we always trigger all due timers, this will cause delay, which will influence back pressure. The way the system is set up now, I see no other way than to let back pressure dial in to sustained throughput to the speed of the slowest link in the processing chain. |
Sorry for the late response. Interesting observation you made @pihme and sounds right. In my case it was more like @saig0 described that the actor didn't stop to write due timers, I had many many write records per second but no processing at all. I agree with the argument from @pihme. Not sure whether it would really help to just limit it, sounds like a band-aid 🩹 I guess optimal would be to have a separate actor, which writes to the log. This would at least help here with not blocking processing. Still it can happen that we write many many timers at once. We could solve that then maybe via writing over the command API or something such that back pressure is taken into account like @pihme said. 🤷 |
@Zelldon Could you comment on the prioritization of actor tasks with respect to time based tasks vs. ordinary tasks? More precisely: Supposing the |
And the second question: What happens when you schedule a task with a negative delay? Because I think this can happen |
Hey @pihme as always it depends. In general, the timers and "ordinary" tasks are at the end scheduled the same. We have the fastlane and the queue, for "less" priority. So we have kind of an implicit priority. What I understand from the code is that if there are jobs in the fastlane already they are executed first and due timers are added to the fastlane as well. I guess the issue is even more problematic if we use To get to your example: If the So in the above-described example and scenario, which I used in the gameday was that the DueTimerchecker had thousand or millions of Timers, which are due at the same time. It blocked the actor thread completely because it iterated over the whole column family to write a timer trigger command. I guess due to the subscription logic and "priority" that timers are scheduled in fast-lane the processing was not happening for a long time. This is my current assumption. Regarding the negative duration, this happened to me accidentally in the gameday. This means the timer is immediately due. Does this help? |
Summary of Preliminary FindingsThere are two problems we need to solve:
The first problem was the focus of the original description of the issue. This could for example occur when a company wants to send their customers a thank you email on Christmas. Throughout the year timers are collected, which all fire on Christmas day simultaneously. Here we see the following issues:
The second problem was observed by Roman in benchmarks. Here it is just a steady load which leads to indeterministic blocking of the processing thread. Here we see the following issues:
|
Just want to share my thoughts 😄 @Zelldon, Maybe a comment on your #8991 (comment) 😄
In the described scenario, with millions of timers being due simultaneously, I think the problem is not with how the actor schedules the jobs. In my opinion, this This means it will remain in this loop as long as there is a next timer event being due. With millions of timers being due simultaneously, it will execute that loop until it visits all the millions of timer events...and this may take some time and block processing on that partition in the meantime.
I don't think my scenario is much different from Chris's scenario in the gameday. The only difference is there are more than 20k timer events on one partition, whereby the timer's due date is distributed over the next 20 minutes. So, not all of them are due at the same time. What can happen (and happens) is that it will remain in the |
Hey @romansmirnov yes this is what I wrote the sentence before 😅 |
I agree with that point. I even tried that but it failed with the exception in #9187 - but I haven't investigated it further. Besides that, the only concern I may have, if there is only one CPU thread configured 😉 |
@romansmirnov you mean because it would block the thread again and the other actor can't run? Might be then the yield or backpressure would help here? 🤔 |
@Zelldon, yes, exactly.
I think yield would help but haven't thought it through yet. |
Update before handover: There were three problems found in
With the two fixes, I was unable to repeatedly recover a cluster which was unresponsive. I was able to recover it once, by also implementing a quick fix for the third issue, but I was unable to reproduce the success. There were a couple of problems found unrelated to
Another issue not needed for the fix, but to make it better to diagnose issue: From what I could see from the logs of the server it seems to be stuck at some other place as well. It makes good progress, intermitted by long breaks. I don't know what is causing the breaks. The data set is linked in one of the comments above.
However, I have not observed any commands that exceeded 1 s. Another thing to consider is whether the same issues of DueDateTimerChecker would also need to be implemented for job timeouts. (in the logs I saw a lot of jobs timing out) |
SummaryA partition gets unhealthy if many timers are scheduled for the same time (or in a short time frame). @romansmirnov ran into this issue with a process that has two timers after each other (including some service tasks in between). The timers are scheduled for 20 minutes. With cluster plan The partition got unhealthy but it recovered after a while (20 minutes - 1 hour). We will not work on this issue immediately because
|
Updated severity to reflect current team opinion described here. Issue is moved into backlog. |
## Description <!-- Please explain the changes you made here. --> This pull request aims to avoid duplicate timer `TRIGGER` commands and avoids the pathological case where this would deterministically result in #17128. An integration test reproducing that scenario is provided in `TimerTriggerSchedulingTest`. This pull request achieves this by introducing two main changes: - It cancels existing future execution runs of the `DueDateChecker` when scheduling another to avoid many scheduled executions of the checker at any time - It provides a threadsafe way to schedule, execute, and cancel the executions of the `DueDateChecker` - this is to ensure that the bug is also fixed when enabling the async due date checker > [!NOTE] > Cancellation in concurrent (or async) scheduling/execution does not guarantee to prevent execution. This pull request also: - Implements the `onClose` and `onFailure` lifecycle events for the `DueDateChecker` - It refactors the `DueDateChecker` for understandability and adds a lot of documentation - Introduces an `AtomicUtil` class to allow threadsafe lock-free replacing of AtomicReference - The initial execution of `DueDateChecker` is scheduled after 100ms instead of 0ms ## Related issues <!-- Which issues are closed by this PR or are related --> This PR: resolves #17128 resolves #10541 Not sure exactly, but could: resolves #16884 resolves #8991
Describe the bug
We deployed the following process
There were two bugs in the process:
=date and time(today(), time("16:00:00@Europe/Berlin")) - now()
it caused an negative duration after reaching the time.When we deploy the model, we create endless new process instances and fill the timer column. If the specific time is reached we have to trigger a lot of timers, might be millions depending how long the recursion was running and how many instances have initially be started. The recursion continues and adds more and more due timers to the timer column.
The problem here is that the processing is completely blocked, since the timer checker is running in the same actor.. Furthermore, it causes to produces an endless backlog, which the processor is not able to catch up. The processor will become unhealthy and the backpressure wents up to 100% since the requests are not answered.
To Reproduce
Deploy this model
gamedayprocess.bpmn.txt and start new instances watch out, you can throw your cluster away afterwards.
Expected behavior
As a user I'm not able to completely f** up my cluster, or at least there is some way to recover from this.
The text was updated successfully, but these errors were encountered: