-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing TaskResult in ProcessScheduleService can block complete processing #10526
Comments
Right now I feel it is related to the LogStreamAppender not emptying the dispatcher fast enough, and the ProcessingScheduler not stepping back (hammering the ActorScheduler with new jobs) I think it is also related to our RetryStrategy changes in 8.0.6 we still use the submit (queue) https://github.com/camunda/zeebe/blob/8.0.6/util/src/main/java/io/camunda/zeebe/util/retry/AbortableRetryStrategy.java PReviously we actually yield see here feefb5f?diff=split#diff-7bc91aefc7fc3a1e2119ce6f34a09e16461f456143a7e2ade9e8d2de3cdfeaa6L43 |
I think there are multiple problems right now:
|
Had a look together with @saig0 at the DueDateChecker, because I was wondering why so many Tasks are scheduled and same batches are written. We found out that timer creations can cause new task scheduling, even if already a task is scheduled. See https://github.com/camunda/zeebe/blob/main/engine/src/main/java/io/camunda/zeebe/engine/processing/scheduled/DueDateChecker.java#L53 here the Previously we used the This issue caused a load of scheduled timers, especially if we have many timer creations. This again caused many (and same timer Triggers per second), which caused a big backlog and high backpressure. I will run a benchmark to see how it behaves, but I think it can explain the high backpressure in the mixed benchmarks. |
10534: Fix timer scheduling r=saig0 a=Zelldon ## Description If we already scheduled a Task for a created Timer, we shouldn't schedule another Task on next timer creation. This PR fixes that. The previous behavior could cause severe issues, like exponential increasing backlog if a lot of timers are created per second etc. see #10532 and #10526 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10532 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
10534: Fix timer scheduling r=Zelldon a=Zelldon ## Description If we already scheduled a Task for a created Timer, we shouldn't schedule another Task on next timer creation. This PR fixes that. The previous behavior could cause severe issues, like exponential increasing backlog if a lot of timers are created per second etc. see #10532 and #10526 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10532 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
10534: Fix timer scheduling r=Zelldon a=Zelldon ## Description If we already scheduled a Task for a created Timer, we shouldn't schedule another Task on next timer creation. This PR fixes that. The previous behavior could cause severe issues, like exponential increasing backlog if a lot of timers are created per second etc. see #10532 and #10526 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10532 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
The main problem was fixed with #10534 I created follow up issues:
I will close this issue then. |
10527: deps(maven): bump software.amazon.awssdk:bom from 2.17.282 to 2.17.283 r=oleschoenburg a=dependabot[bot] Bumps [software.amazon.awssdk:bom](https://github.com/aws/aws-sdk-java-v2) from 2.17.282 to 2.17.283. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java-v2/blob/master/CHANGELOG.md">software.amazon.awssdk:bom's changelog</a>.</em></p> <blockquote> <h1><strong>2.17.283</strong> <strong>2022-09-27</strong></h1> <h2><strong>AWS Cost Explorer Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release is to support retroactive Cost Categories. The new field will enable you to retroactively apply new and existing cost category rules to previous months.</li> </ul> </li> </ul> <h2><strong>AWS SDK for Java v2</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Updated service endpoint metadata.</li> </ul> </li> </ul> <h2><strong>AWSKendraFrontendService</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>My AWS Service (placeholder) - Amazon Kendra now provides a data source connector for DropBox. For more information, see <a href="https://docs.aws.amazon.com/kendra/latest/dg/data-source-dropbox.html">https://docs.aws.amazon.com/kendra/latest/dg/data-source-dropbox.html</a></li> </ul> </li> </ul> <h2><strong>Amazon Location Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release adds place IDs, which are unique identifiers of places, along with a new GetPlace operation, which can be used with place IDs to find a place again later. UnitNumber and UnitType are also added as new properties of places.</li> </ul> </li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/8af53203a1e879237b2362d3e26b86949a159d98"><code>8af5320</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/aws/aws-sdk-java-v2/issues/2154">#2154</a> from aws/staging/531bec79-3529-4b44-b7a1-f6feb1ded95e</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/d3bc7121fd663478fdaf764aad5a9848b33730b3"><code>d3bc712</code></a> Release 2.17.283. Updated CHANGELOG.md, README.md and all pom.xml.</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/e069e7bc3f8563dcd10c119eb26aac26ea769fe1"><code>e069e7b</code></a> Updated endpoints.json.</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/63a9497c648f39e6dded6a38422dd225ed03d075"><code>63a9497</code></a> Amazon Location Service Update: This release adds place IDs, which are unique...</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/ee3958dd062c97be648d64ca49877bb3935c0b5d"><code>ee3958d</code></a> AWS Cost Explorer Service Update: This release is to support retroactive Cost...</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/0abace4f454b92914f76bde97ff7de95ba665aae"><code>0abace4</code></a> AWSKendraFrontendService Update: My AWS Service (placeholder) - Amazon Kendra...</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/0939324e88ff893f012bbb94715d3e1d90da9617"><code>0939324</code></a> Make WebIdentityTokenFileCredentialsProvider implement AutoClosable (<a href="https://github-redirect.dependabot.com/aws/aws-sdk-java-v2/issues/3440">#3440</a>)</li> <li><a href="https://github.com/aws/aws-sdk-java-v2/commit/57677db7068b707f241b8cd26b1261e8987869a6"><code>57677db</code></a> Update to next snapshot version: 2.17.283-SNAPSHOT</li> <li>See full diff in <a href="https://github.com/aws/aws-sdk-java-v2/compare/2.17.282...2.17.283">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 10557: Do not copy ResultBatch everytime r=Zelldon a=Zelldon ## Description I have seen when running shouldPreserveOrderingOfWritesEvenWithRetries that it sometimes can happen that the writing (retry loop) can take up to 3 seconds, where the timeout is 2 seconds. I think this is related to the that we always copy again in the loop (which is not necessary). I also have observed in #10526 (comment) Related to #10458 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #10458 related to #10526 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Describe the bug
It looks like it got worse as we had before with #8991. We run with our recent release candidate our e2e tests, which completely destroyed our cluster. It only recovered after the e2e load stopped.
The backpressure went up to 100%, and we see warnings in our logs because the partition went unhealthily. Requests can't be answered and it seems nothing can be processed.
Related we see #10519 that the exporter has a really big fastLane queue.
Be aware we run the same test again with 8.0.6 and it looked way better.
The following process is executed (besides some other process).
See https://github.com/camunda/zeebe-e2e-test
What we can see is that the overall processing latency is quite high (which indicates that the time between dispatcher and reading again is high)
Furthermore, the processing queue is unexpected high

To Reproduce
I was able to reproduce this locally with a StandaloneBroker and a starter see here https://github.com/camunda/zeebe/tree/zell-investigate-timer-loop
When running the branch with the added logs we can see that it loops in writing the TaskResults.
Expected behavior
We expect that we can handle the scheduled timers, and don't block our processing.
Log/Stacktrace
In the log we can see that it is really looping here https://github.com/camunda/zeebe/blob/zell-investigate-timer-loop/engine/src/main/java/io/camunda/zeebe/streamprocessor/ProcessingScheduleServiceImpl.java#L153
LOG
Environment:
The text was updated successfully, but these errors were encountered: