Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process instance no longer subscribed to message after unsuccessful process instance migration #25166

Closed
korthout opened this issue Nov 25, 2024 · 5 comments
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.3.18 version:8.4.14 version:8.5.10 version:8.6.6 version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2

Comments

@korthout
Copy link
Member

korthout commented Nov 25, 2024

Describe the bug

We've discovered a critical bug that can delete the ProcessMessageSubscription and MessageSubscription during Process Instance Migration, even though the migration is rejected. This should not have any affects and all changes are supposed to have been rolled back.

https://jira.camunda.com/browse/SUPPORT-24591

To Reproduce

  • Create an instance of a process that subscribes to a message catch event
  • Migrate the process instance to a process with the same message catch event, without mapping the message catch event
  • Notice that the process instance migration is rejected
    • "Expected to migrate process instance '2251799813685253' but active element with id 'Activity_04uuugc' attempts to subscribe to a message it is already subscribed to with name 'PutOnHold'. Migrating active elements that subscribe to a message they are already subscribed to is not possible yet. Please provide a mapping instruction to message catch event with id 'Event_0qc5a9d' to migrate the respective message subscription."
  • After at least 30 seconds, publish a message
  • Notice that the message is not correlated

To recover: retry the process instance migration after waiting at least 30 seconds. This will automatically recreate the deleted message subscription as documented:

Add catch events: if a catch event of the target process is not the target of a mapping instruction, then a new subscription is opened during migration.

Expected behavior

When the migration is rejected, the process instance should not have been changed.

Log/Stacktrace

Full Stacktrace

..
13:35:16 C PI_MIGRATION MIGRATE        - #27-> -1 K04 - ProcessInstanceMigrationRecord {"targetProcessDefinitionKey":2251799813685251,"mappingInstructions":[{"sourceElementId":"A","targetElementId":"A"}],"processInstanceKey":2251799813685252}
13:35:16 R PI_MIGRATION MIGRATE        - #28->#27 K04 - ProcessInstanceMigrationRecord {"targetProcessDefinitionKey":2251799813685251,"mappingInstructions":[{"sourceElementId":"A","targetElementId":"A"}],"processInstanceKey":2251799813685252} !INVALID_STATE (Expected to migrate process instance '2251799813685252' but active element with id 'A' attempts to subscribe to a message it is already subscribed to with name 'message'. Migrating active elements that subscribe to a message they are already subscribed to is not possible yet. Please provide a mapping instruction to message catch event with id 'boundary' to migrate the respective message subscription.)
13:35:46 C MSG_SUB      DELETE         - #29-> -1  -1 - "message" (inter.) @[K09] in <process ?[K04]> (no vars)
13:35:46 E MSG_SUB      DELETED        - #30->#29 K12 - "message" (inter.) correlationKey: key @[K09] in <process "process"[K04]> (no vars)
13:35:46 C PROC_MSG_SUB DELETE         - #31->#29  -1 - "message" (inter.) @[K09] in <process ?[K04]> (no vars) (tenant: <default>)
13:35:46 E PROC_MSG_SUB DELETED        - #32->#29 K10 - "message" (inter.) correlationKey: key @[K09] in <process "process"[K04]> (no vars) (tenant: <default>)

Environment:

@korthout korthout added kind/bug Categorizes an issue or PR as a bug severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround component/zeebe Related to the Zeebe component/team labels Nov 25, 2024
@npepinpe npepinpe added the support Marks an issue as related to a customer support request label Nov 25, 2024
@korthout korthout assigned korthout and berkaycanbc and unassigned korthout Nov 25, 2024
@korthout
Copy link
Member Author

@berkaycanbc and I investigated this issue and have the following findings:

  • the message subscription is DELETE command is written to the log some 30 seconds after the migration was rejected
  • the PendingMessageSubscriptionChecker sends the DELETE command and runs every 30 seconds
  • it sends the DELETE command because it finds the message subscription in the pending state
  • it finds it in the transient state because it is not reverted on rejection

One solution could be to remove the pending subscription from this transient state when rejecting the instance migration (and reverting changes). However, that would mean that the checker would be able to encounter the pending subscription before it may still be reverted.

Instead, it would be better to avoid adding the pending subscription to the transient state until the entire command has been processed successfully.

We could try to achieve this using the stream processor's side effects, which are only applied after a command was successfully processed.

@korthout
Copy link
Member Author

korthout commented Nov 25, 2024

@berkaycanbc and I have been making progress on this issue (branch: korthout-berkay-25166-transient-message-state-not-transactional). We've found a solution to the problem by using a side-effect to make the transient state change.

We've verified that this works with our reproducing test case, however this test is slow and not maintainable, so we will open a PR as soon as we have one that is. We also need to resolve the issue on subscribing to new messages as this could also produce wrong results (although less impactful).

@lenaschoenburg
Copy link
Member

@korthout Since it's possible to recover from this (at least that's how I understood your "To recover" instructions), I'd argue that this doesn't qualify as critical.

@korthout korthout added severity/high Marks a bug as having a noticeable impact on the user with no known workaround likelihood/high A recurring issue and removed severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Nov 25, 2024
@korthout
Copy link
Member Author

Good point @lenaschoenburg. I considered it critical because of the impact, but the workaround should be reasonable. Lowered to high.

github-merge-queue bot pushed a commit that referenced this issue Dec 2, 2024
## Description

<!-- Describe the goal and purpose of this PR. -->

This PR will enable the transient state in
`ProcessMessageSubscriptionState` to be updated as a side effect.

### Why do we need this change?

This resolves a critical bug, where a process instance is unsubscribed
from a message catch event during instance migration even though the
migration was rejected by throwing an exception.

When the exception is thrown, we expect all state changes to be rolled
back. However, the transient state for pending process message
subscriptions was not rolled back. This led to the
`PendingSubscriptionChecker` to visit this pending subscription, even
though the rest of the engine considered it to be rolled back.

The transient state was updated by the
`DbProcessMessageSubscriptionState.updateToClosingState()` which in turn
is called by the `ProcessMessageSubscriptionDeletingApplier`.

By moving the transient state update from the dbstate (applier) to a
side-effect in the `CatchEventBehavior`, we're able to ensure that the
state change only takes effect after the command was successfully
processed (completely).

## Related issues

closes #25166
github-merge-queue bot pushed a commit that referenced this issue Dec 2, 2024
# Description
Backport of #25298 to `main`.

relates to #25166
original author: @berkaycanbc
berkaycanbc added a commit that referenced this issue Dec 2, 2024
…age state as side-effect (#25495)

# Description
Backport of #25476 to `release-8.7.0-alpha2`.

relates to #25298 #25166
original author: @backport-action
github-merge-queue bot pushed a commit that referenced this issue Dec 2, 2024
…as side-effect (#25497)

# Description
Backport of #25476 to `stable/8.5`.

relates to #25298 #25166
original author: @backport-action
@korthout korthout closed this as completed Dec 4, 2024
@korthout korthout added version:8.3.18 version:8.4.14 version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2 labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.3.18 version:8.4.14 version:8.5.10 version:8.6.6 version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2
Projects
None yet
Development

No branches or pull requests

4 participants