Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cluster): do no use channel created for a different address #12305

Merged
5 commits merged into from
Apr 12, 2023

Conversation

deepthidevaki
Copy link
Contributor

Description

When only using IP to find a channel pool, we observed that a channel pool created for a node is re-used by another node which got its old ip. This could lead to unexpected situations, where messages are sent to wrong node.

To fix this, we now use both address and ip to find a channel pool. In a setup where restarts and ip re-assignment is common, it is safer this way.

Related issues

closes #12173

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2023

Test Results

   952 files   - 101     952 suites   - 101   1h 7m 12s ⏱️ - 44m 52s
7 758 tests  - 764  7 742 ✔️  - 758  16 💤  - 6  0 ±0 
7 962 runs   - 768  7 946 ✔️  - 762  16 💤  - 6  0 ±0 

Results for commit 7c7dccc. ± Comparison against base commit 8b3bdc9.

This pull request removes 1213 and adds 449 tests. Note that renamed tests count towards both.
DmnEvaluationTest If successfully evaluated, the output ‑ Should return a message pack output[6] value={y=true, z=[1, 2, 3], x=1}
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[56] collection=[Left[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[57] collection=[Right[[B@2c04fea6], Left[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[58] collection=[Left[[B@2c04fea6], Right[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[59] collection=[Left[[B@2c04fea6], Left[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[60] collection=[Right[[B@2c04fea6], Right[[B@2c04fea6], Left[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[61] collection=[Right[[B@2c04fea6], Left[[B@2c04fea6], Right[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[62] collection=[Right[[B@2c04fea6], Left[[B@2c04fea6], Left[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[63] collection=[Left[[B@2c04fea6], Right[[B@2c04fea6], Right[[B@2c04fea6]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[64] collection=[Left[[B@2c04fea6], Right[[B@2c04fea6], Left[[B@2c04fea6]]
…
DmnEvaluationTest If successfully evaluated, the output ‑ Should return a message pack output[6] value={x=1, z=[1, 2, 3], y=true}
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[56] collection=[Left[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[57] collection=[Right[[B@2b383625], Left[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[58] collection=[Left[[B@2b383625], Right[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[59] collection=[Left[[B@2b383625], Left[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[60] collection=[Right[[B@2b383625], Right[[B@2b383625], Left[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[61] collection=[Right[[B@2b383625], Left[[B@2b383625], Right[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[62] collection=[Right[[B@2b383625], Left[[B@2b383625], Left[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[63] collection=[Left[[B@2b383625], Right[[B@2b383625], Right[[B@2b383625]]
EitherTest Streams of Eithers can be collected using .collector() ‑ Only Streams with Lefts are collected into a Left[64] collection=[Left[[B@2b383625], Right[[B@2b383625], Left[[B@2b383625]]
…

♻️ This comment has been updated with latest results.

When only using IP to find a channel pool, we observed that a channel pool
created for a node is re-used by another node which got its old ip. This
could lead to unexpected situations, where messages are sent to wrong node.

To fix this, we now use both address and ip to find a channel pool. In a
setup where restarts and ip re-assignment is common, it is safer this way.
If a new message is send while the channel is beingclosed, then it
will use the existing channel and the test fails. So wait until it is
closed before sending the new message
@deepthidevaki deepthidevaki force-pushed the dd-12173-channel-pool branch from a36cd30 to 675b006 Compare April 6, 2023 08:13
@npepinpe
Copy link
Member

npepinpe commented Apr 6, 2023

Weird that the backport flagged a checkstyle issue 🤔

@deepthidevaki deepthidevaki marked this pull request as ready for review April 6, 2023 09:54
@deepthidevaki deepthidevaki requested a review from npepinpe April 6, 2023 09:54
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Sorry for the slow review

  • ❓ I'm a little bit out of the topic again, but is there a risk we leave some garbage in the channel pool now? I know we remove the channel from the pool once it's closed, but I honestly can't tell if we can guarantee that channels are always closed, or that the listener is always called. I think this is why Netty has the IdleStateHandler you can add to the pipeline 🤷 Not a blocker, as this may not be a new problem anyway 😄
  • 🔧 I'm not super keen on having to create a tuple every time we want to get a channel, so on every request. At the same time, I don't know if that really has some impact, so 🤷 Maybe interesting to TLAB allocs on day via Pyroscope.

final List<CompletableFuture<Channel>> channelPool = getChannelPool(address);
final InetAddress inetAddress = address.address();
if (inetAddress == null) {
final CompletableFuture<Channel> failedFuture = new OrderedFuture<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I always wonder why we use OrderedFuture 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. This part of the code used OrderedFuture, so I used it for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it maybe because we want to execute all the "listeners" which listen for the future completion in order on which they have been registered? 🤔 And the normal completable futruee doesn't guarantee this ? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but I never understood why this is such a big advantage 🤷 We could likely write everything without this constraint

@github-actions
Copy link
Contributor

Setup

Deployed to measurement-4671156218

camunda-platform:
  zeebe:
    image:
      repository: gcr.io/zeebe-io/zeebe
      tag: dd-12173-channel-pool-benchmark-7c7dccc
  zeebe-gateway:
    image:
      repository: gcr.io/zeebe-io/zeebe
      tag: dd-12173-channel-pool-benchmark-7c7dccc
global:
  image:
    tag: dd-12173-channel-pool-benchmark-7c7dccc

Measurement before

Process Instance Execution Time: p99=1.421 p90=0.317 p50=0.081
Throughput: 150.246 PI/s
Grafana

Chaos injection

Deployed chaos network-latency-5

Measurement after

Process Instance Execution Time: p99=4.078 p90=2.180 p50=0.857
Throughput: 74.906 PI/s
Grafana

Details

See https://github.com/camunda/zeebe/actions/runs/4671156218

@deepthidevaki
Copy link
Contributor Author

  • question I'm a little bit out of the topic again, but is there a risk we leave some garbage in the channel pool now? I know we remove the channel from the pool once it's closed, but I honestly can't tell if we can guarantee that channels are always closed, or that the listener is always called. I think this is why Netty has the IdleStateHandler you can add to the pipeline shrug Not a blocker, as this may not be a new problem anyway smile

If the channel is never closed, it is not removed from the pool. But I don't think that will happen quite often, right?

  • wrench I'm not super keen on having to create a tuple every time we want to get a channel, so on every request. At the same time, I don't know if that really has some impact, so shrug Maybe interesting to TLAB allocs on day via Pyroscope.

I would assume it doesn't have a significant impact. Th benchmark also doesn't show any impact.

@deepthidevaki
Copy link
Contributor Author

bors merge

@ghost
Copy link

ghost commented Apr 12, 2023

Build succeeded:

@ghost ghost merged commit 82d4d91 into main Apr 12, 2023
@ghost ghost deleted the dd-12173-channel-pool branch April 12, 2023 08:14
@backport-action
Copy link
Collaborator

Successfully created backport PR for stable/8.0:

@backport-action
Copy link
Collaborator

Successfully created backport PR for stable/8.1:

@backport-action
Copy link
Collaborator

Successfully created backport PR for stable/8.2:

ghost pushed a commit that referenced this pull request Apr 12, 2023
12378: [Backport stable/8.2] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.2`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this pull request Apr 13, 2023
12377: [Backport stable/8.1] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.1`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this pull request Apr 14, 2023
12377: [Backport stable/8.1] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.1`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this pull request Apr 14, 2023
12376: [Backport stable/8.0] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.0`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this pull request Apr 14, 2023
12376: [Backport stable/8.0] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.0`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this pull request Apr 14, 2023
12376: [Backport stable/8.0] fix(cluster): do no use channel created for a different address r=deepthidevaki a=backport-action

# Description
Backport of #12305 to `stable/8.0`.

relates to #12173

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@megglos megglos added the version:8.2.2 Marks an issue as being completely or in parts released in 8.2.2 label Apr 17, 2023
@lenaschoenburg lenaschoenburg added the version:8.1.11 Marks an issue as being completely or in parts released in 8.1.11 label Apr 21, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.11 Marks an issue as being completely or in parts released in 8.1.11 version:8.2.2 Marks an issue as being completely or in parts released in 8.2.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zeebe node sends messages to wrong node
7 participants