Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Generalized command distribution #11456

Closed
15 of 16 tasks
korthout opened this issue Jan 20, 2023 · 6 comments
Closed
15 of 16 tasks

[EPIC] Generalized command distribution #11456

korthout opened this issue Jan 20, 2023 · 6 comments
Assignees
Labels
area/project Marks an issue as related to project management (e.g. PR templates, editor config, etc.) component/engine kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@korthout
Copy link
Member

korthout commented Jan 20, 2023

Description

We know how to generalize the deployment distribution concept to use it for all record value types. The next step is to break this down into technical tasks.

Open questions

  • how can we deal with reaching the maxMessageSize twice as fast?

For deployment, the proposal requires us to write the deployed resources twice in the same record batch, i.e. once in the Process:Created (or DecisionRequirements:Created) events, and once in the RecordDistribution:Started event. That means that we reach the maxMessageSize twice as fast. In other words, users will no longer be able to deploy some of the resources they could before. The team proposes to solve this by doubling the current maxMessageSize. We will discuss this with the Zeebe Distributed Platform team, as they may challenge this.

ZDP has suggested to:

  • introduce a soft limit (MAX_MESSAGE_SIZE) and a hard limit (🤷 arbitrary, perhaps half the segment size) for record batches, where some processors can choose to ignore the soft limit
  • increase the limit for all record batches, and instead produce warnings when record batches exceed the MAX_MESSAGE_SIZE

Which it will be is not yet decided.

  • Can we treat the distributed command as a user command and acknowledge it as a response?
Although this question is unanswered, we've decided to move forward without using this

This question came up while deciding 'can we de-deprecate a method in the InterPartitionCommandSender?': Instead of processing distributed commands differently by adding a post-commit task to Acknowledge the command, we could treat the distributed command like other user commands and respond to it. Instead of sending the response to the gateway, it would write a command to another partition.

Yes, we will de-deprecate the method

Another topic that should be discussed with the Zeebe Distributed Platform (ZDP) team is de-deprecating the method to send a command to another partition where the sending partition determines the record key. This method was deprecated immediately when it was introduced with the new InterPartitionCommandSender API to avoid us from using it for other cases than Deployment Distribution. However, we could not figure out the reason why this method should not be used. Actually, the team sees reasonable use cases where a partition can decide the record key when it sends it to another partition. As the key encodes the partition id it was generated on, it clearly identifies an entity to exist on that specific partition. We should discuss this with the Zeebe Distributed Platform team to understand why this method was deprecated in the first place and push to revise that stance.

After discussing with ZDP, we agreed that it is necessary for Deployments to re-use the key. The deployment must be available on all partitions with the same key. Otherwise, resource deletion would not be possible because we somehow need to refer to a single resource (Deployment). Initially, we thought it must also be available for process creation, but the process key is already part of the ProcessMetadata which is part of the Deployment command that is distributed to the other partitions.

Also part of the decision is that we clearly document the usage of this method. When is it allowed to use it and when not. It must also be clear why the method must exist.

For full details of the discussion, please find the associated (archived) Slack channel #top-zeebe-dedeprecate-send-command-with-key.

Concept and goal

Some entities in Zeebe are cross-cutting partitions. For example, when you deploy a new Process, you want to be able to start Process Instances of that Process on any of the partitions. But the gateway writes the Deployment:Create command only on a single partition when it handles a call to the DeployResource RPC. The engine has to communicate this command to the other partitions (i.e. each partition has its own engine). This communication is unreliable because of the nature of distributed systems. We need a way to communicate commands between the partitions reliably. We call this mechanism command distribution.

Several features require this same mechanism. Instead of building the same logic multiple times, we should generalize the concept to reuse it for all of these. This maintenance task will unblock:

We can achieve this with the following proposed solution:
Introduce a new value type called CommandDistribution, which wraps the command we want to communicate reliably to other partitions.

An initial STARTED intent ensures we store that command in the state (for lookups at retry), and a final FINISHED event can be applied to remove this command from the state.

Then, for every partition, we have intents DISTRIBUTING, ACKKNOWLEDGE, and ACKNOWLEDGED to keep track of the distribution to that specific partition.

On the receiving partition, we write the encapsulated command directly. Its key contains the information that it was distributed (and from which partition), so the engine knows the send the ACKNOWLEDGE back to the distributing partition.

Task breakdown

Tasks

Preview Give feedback
  1. component/engine kind/toil version:8.2.0 version:8.2.0-alpha5
    remcowesterhoud
  2. component/engine kind/feature version:8.2.0
    remcowesterhoud
  3. component/engine kind/toil version:8.2.0
    remcowesterhoud
  4. component/engine kind/toil version:8.2.0
    remcowesterhoud
  5. component/engine kind/toil version:8.2.0
    remcowesterhoud
  6. component/engine kind/toil scope/broker version:8.3.0 version:8.3.0-alpha3
    korthout
  7. component/engine kind/toil version:8.3.0 version:8.3.0-alpha3
    korthout
  8. component/engine kind/toil version:8.3.0 version:8.3.0-alpha3
    korthout remcowesterhoud
  9. component/engine kind/toil version:8.3.0 version:8.3.0-alpha3
    remcowesterhoud
  10. component/engine kind/documentation version:8.3.0 version:8.3.0-alpha3
    remcowesterhoud
  11. component/engine kind/toil version:8.3.0 version:8.3.0-alpha3
    remcowesterhoud
  12. area/ux component/engine component/zeebe kind/toil
  13. area/performance component/engine component/gateway kind/feature
  14. area/reliability component/db component/engine kind/bug severity/high version:8.1.14 version:8.2.8 version:8.3.0 version:8.3.0-alpha3
    korthout
  15. component/engine good first issue kind/bug onboarding regression scope/broker severity/high version:8.1.14 version:8.2.8 version:8.3.0 version:8.3.0-alpha4
    nicpuppa
  16. component/exporter kind/documentation target:8.3
    remcowesterhoud
@korthout korthout added area/project Marks an issue as related to project management (e.g. PR templates, editor config, etc.) component/engine labels Jan 20, 2023
@korthout korthout self-assigned this Jan 20, 2023
@korthout korthout added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Jan 20, 2023
@korthout
Copy link
Member Author

korthout commented Feb 9, 2023

Moving this to blocked while deciding on the open questions. These are expected to be answered at the start of the upcoming week.

@korthout
Copy link
Member Author

Moved this into in progress. Open questions have either been answered or don't block progress.

@korthout korthout changed the title Breakdown generalized record distribution [EPIC] Generalized record distribution Feb 14, 2023
@korthout korthout added kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels Feb 14, 2023
@korthout
Copy link
Member Author

Upgrading this issue to EPIC as the breakdown is now clear, and it would be silly to copy the open questions, concept, and tasks breakdown into a new issue just for that.

ghost pushed a commit that referenced this issue Jun 6, 2023
12971: Command distribution improvements r=korthout a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->

These improvements are necessary to distribute commands using the CommandDistributionBehavior.

The commits are extracted from the following pull request to reduce its size and unblock other topics:
- #11962 

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #11456 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@korthout korthout changed the title [EPIC] Generalized record distribution [EPIC] Generalized command distribution Jun 26, 2023
ghost pushed a commit that referenced this issue Jun 28, 2023
13207: Distribute DMN resource deletions r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
With the (almost) completion of #11456 we are now able to use this upon deleting a resource. As of this time only DMN resource deletion is implemented. This PR does some refactorings to this code and adds the distribution and acknowledgements.

I recommend reviewing this per commit.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #13204 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
ghost pushed a commit that referenced this issue Jun 28, 2023
13207: Distribute DMN resource deletions r=remcowesterhoud a=remcowesterhoud

## Description

<!-- Please explain the changes you made here. -->
With the (almost) completion of #11456 we are now able to use this upon deleting a resource. As of this time only DMN resource deletion is implemented. This PR does some refactorings to this code and adds the distribution and acknowledgements.

I recommend reviewing this per commit.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #13204 



Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@remcowesterhoud
Copy link
Contributor

@korthout do we want to close this epic and leave the remaining issue as a standalone improvement?

@korthout
Copy link
Member Author

I'm investigating a small regression (max deployable resource payload reduced from ~2MB to ~1.4MB on multi-partition clusters) due to

We can keep it separate but might be good just to track it here until completed.

@korthout
Copy link
Member Author

We can track #13233 separately. Let's close this epic! 🎉

@ChrisKujawa ChrisKujawa added the version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 label Jul 6, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
github-merge-queue bot pushed a commit that referenced this issue Aug 9, 2024
* refactor: remove unused event based process endpoints

relates to #11456

* chore: use event batch delete endpoint in e2e cleanup

* chore: fix PMD comments

---------

Co-authored-by: Michał Konopski <michal.konopski@camunda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/project Marks an issue as related to project management (e.g. PR templates, editor config, etc.) component/engine kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues version:8.3.0-alpha3 Marks an issue as being completely or in parts released in 8.3.0-alpha3 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

No branches or pull requests

4 participants