Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in deploying large payloads #13233

Closed
Tracked by #11456
korthout opened this issue Jun 29, 2023 · 8 comments · Fixed by #13505
Closed
Tracked by #11456

Regression in deploying large payloads #13233

korthout opened this issue Jun 29, 2023 · 8 comments · Fixed by #13505
Assignees
Labels
component/engine good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug onboarding regression This issue is a recession from previous functionality scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.14 Marks an issue as being completely or in parts released in 8.1.14 version:8.2.8 Marks an issue as being completely or in parts released in 8.2.8 version:8.3.0-alpha4 Marks an issue as being completely or in parts released in 8.3.0-alpha4 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@korthout
Copy link
Member

korthout commented Jun 29, 2023

We've introduced a regression on deploying large payloads on multi-partition clusters with

Before that pull request, a deployed resource was written in two follow-up events:

  • Deployment:CREATED
  • Process:CREATED (or DecisionRequirements:CREATED depending on resource)

The specialized DeploymentDistribution would take the resource from the Deployment:CREATED to distribute it to the other partitions.

With #11456, a new event CommandDistribution:STARTED was introduced to store the command for distribution. For deployments, this contains the entire Deployment incl. the resource. This event is only appended on multi-partition clusters. But, when it is appended, it further reduces the maximum payload size because the follow-up events reach the MAX_BATCH_SIZE restriction with a lower payload size.

Note The MAX_BATCH_SIZE is a limitation that originated from the MAX_MESSAGE_SIZE configuration setting but nowadays is only defined by the LogStreamBuilder's MAX_FRAGMENT_SIZE: 4MB.

This regression lowers the maximum payload size of deployments from ~2MB down to ~1.4MB.

The regression does not exist on single partition clusters as these do not require the distribution of the deployment.


Suggested solution:

  • don't write the resource in the Deployment:CREATED event, only in the Process:CREATED and the DecisionRequirements:CREATED events (estimate: x-small)

Of course, we'd need to inform users that we're no longer writing the resource for that event in the Update Guide. As this would be breaking user space. The documentation should clarify that the resource is available in the Process:CREATED and the DecisionRequirements:CREATED instead.

@korthout korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround component/engine regression This issue is a recession from previous functionality labels Jun 29, 2023
@korthout korthout self-assigned this Jun 29, 2023
@korthout
Copy link
Member Author

korthout commented Jun 29, 2023

There are several solution ideas:

  • Allow writing arbitrarily large Raft entries/batches #11513
    • Does not seem trivial to implement
    • Mostly a ZDP effort
  • don't write the resource in the Deployment:CREATED event, only in the Process:CREATED and the DecisionRequirements:CREATED events (estimate: x-small)
    • would require alignment with Operate and Optimize to make sure they consume the resource from Process:CREATED

@korthout
Copy link
Member Author

@sdorokhova Can you verify for me that Operate consumes the Process:CREATED event to get the BPMN model and not the Deployment:CREATED event? Likewise, does it consume the DecisionRequirements:CREATED event for DMN models?

@RomanJRW Can you verify the same as above, but for Optimize?

@sdorokhova
Copy link
Contributor

Hi @korthout ,
we read processes from process index and decision requirements from decision-requirements index. You can check here all the indices we're reading from. Is this what you were asking about?

@korthout
Copy link
Member Author

@sdorokhova Perfect! That's exactly what I mean. I should've mentioned that it concerns the ES indices zeebe-record-process and zeebe-record-decision-requirements.

I was hoping you weren't reading the BPMN/DMN resources from the zeebe-record-deployment index.

ghost pushed a commit that referenced this issue Jun 30, 2023
13239: Deployment payload size regression tests r=korthout a=korthout

## Description

<!-- Please explain the changes you made here. -->
Adds a regression test for the maximum deployment payload size currently possible on a 3-partition cluster.

This also cleans up the other tests related to the size of the deployment's payload in an effort to clarify what they truly verify:
- `CreateLargeDeploymentTest.shouldRejectDeployIfResourceIsTooLarge`
  - tests that the incoming request's maxMessageSize is tested
  - is a regression test against #12591
- `DeploymentRejectionTest.shouldRejectDeploymentIfResourceIsTooLarge`
  - verifies that the deployment command is rejected when its resources fit the max message size but still exceed the batch record size (due to follow-up records)
- `CreateDeploymentTest.shouldRejectDeployIfResourceIsTooLarge`
  - verifies that the client's request is rejected when its resources fit the max message size but still exceed the batch record size (due to follow-up records)

> **Note** This pull request explicitly targets `stable/8.2` as all these tests succeed there, but some will not succeed on `main` due to #13233.

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #13233 



Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@megglos
Copy link
Contributor

megglos commented Jul 3, 2023

@korthout do I understand this correctly that this regression is only present in 8.3? And the changes that happened in 8.2 like #11660 have not caused a regression?

@korthout
Copy link
Member Author

korthout commented Jul 3, 2023

@megglos Correct, the regression only exists on main and if left unfixed will only be present in the upcoming 8.3. The reason is that deployment distribution has been switched over only for 8.3(-alpha*), not for 8.2.

@ChrisKujawa ChrisKujawa added the version:8.2.8 Marks an issue as being completely or in parts released in 8.2.8 label Jul 3, 2023
korthout added a commit that referenced this issue Jul 3, 2023
This test should no longer be ignored when the regression is fixed.

See: #13233
korthout added a commit that referenced this issue Jul 4, 2023
This test should no longer be ignored when the regression is fixed.

See: #13233
ghost pushed a commit that referenced this issue Jul 4, 2023
13291: [Backport stable/8.1] Deployment payload size regression tests r=remcowesterhoud a=korthout

## Description

<!-- Link to the PR that is back ported -->

Backport of 
- #13239

There were conflicts, please see the commit messages.
I also had to adjust a test case for this specific branch.

## Related issues

<!-- Link to the related issues of the origin PR -->
relates to #13233


Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
ghost pushed a commit that referenced this issue Jul 4, 2023
13292: [Backport stable/8.0] Deployment payload size regression tests r=remcowesterhoud a=korthout

## Description

<!-- Link to the PR that is back ported -->

Backport of 
- #13239 

There were conflicts, please see the commit messages.

This backport misses the commit bdcb0d9 because the file `too_large_process.bpmn` did not exist on this branch.

I also had to ignore / remove some test cases because the related bug was not fixed / feature was not implemented on this branch.

And I had to adjust a test case for this specific branch.

## Related issues

<!-- Link to the related issues of the origin PR -->

relates to #13233

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
ghost pushed a commit that referenced this issue Jul 4, 2023
13290: [Forwardport main] Deployment payload size regression tests r=korthout a=korthout

## Description

<!-- Link to the PR that is back ported -->

Forwardport of 
- #13239

There were conflicts, please see the commit messages.

## Related issues

<!-- Link to the related issues of the origin PR -->

relates to #13233


Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@ChrisKujawa ChrisKujawa added the version:8.1.14 Marks an issue as being completely or in parts released in 8.1.14 label Jul 6, 2023
@RomanJRW
Copy link
Contributor

Hey @korthout - apologies for slow response, I have had FTO. I can confirm the same for Optimize, that we read from zeebe-record-process and not the deployment index

@korthout
Copy link
Member Author

Now that both Operate and Optimize have confirmed that they don't access the resource from the zeebe-record-deployment index, I think it's fine to take this solution:

  • don't write the resource in the Deployment:CREATED event, only in the Process:CREATED and the DecisionRequirements:CREATED events (estimate: x-small)

Of course, we'd need to inform users that we're no longer writing the resource for that event in the Update Guide. As this would be breaking user space. The documentation should clarify that the resource is available in the Process:CREATED and the DecisionRequirements:CREATED instead.

@korthout korthout added good first issue Marks an issue as simple enough for first time contributors onboarding labels Jul 11, 2023
@nicpuppa nicpuppa self-assigned this Jul 13, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/engine good first issue Marks an issue as simple enough for first time contributors kind/bug Categorizes an issue or PR as a bug onboarding regression This issue is a recession from previous functionality scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.1.14 Marks an issue as being completely or in parts released in 8.1.14 version:8.2.8 Marks an issue as being completely or in parts released in 8.2.8 version:8.3.0-alpha4 Marks an issue as being completely or in parts released in 8.3.0-alpha4 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants