-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cluster): do no use channel created for a different address #12305
Conversation
Test Results 952 files - 101 952 suites - 101 1h 7m 12s ⏱️ - 44m 52s Results for commit 7c7dccc. ± Comparison against base commit 8b3bdc9. This pull request removes 1213 and adds 449 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
When only using IP to find a channel pool, we observed that a channel pool created for a node is re-used by another node which got its old ip. This could lead to unexpected situations, where messages are sent to wrong node. To fix this, we now use both address and ip to find a channel pool. In a setup where restarts and ip re-assignment is common, it is safer this way.
If a new message is send while the channel is beingclosed, then it will use the existing channel and the test fails. So wait until it is closed before sending the new message
a36cd30
to
675b006
Compare
Weird that the backport flagged a checkstyle issue 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Sorry for the slow review
- ❓ I'm a little bit out of the topic again, but is there a risk we leave some garbage in the channel pool now? I know we remove the channel from the pool once it's closed, but I honestly can't tell if we can guarantee that channels are always closed, or that the listener is always called. I think this is why Netty has the
IdleStateHandler
you can add to the pipeline 🤷 Not a blocker, as this may not be a new problem anyway 😄 - 🔧 I'm not super keen on having to create a tuple every time we want to get a channel, so on every request. At the same time, I don't know if that really has some impact, so 🤷 Maybe interesting to TLAB allocs on day via Pyroscope.
final List<CompletableFuture<Channel>> channelPool = getChannelPool(address); | ||
final InetAddress inetAddress = address.address(); | ||
if (inetAddress == null) { | ||
final CompletableFuture<Channel> failedFuture = new OrderedFuture<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I always wonder why we use OrderedFuture
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. This part of the code used OrderedFuture
, so I used it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it maybe because we want to execute all the "listeners" which listen for the future completion in order on which they have been registered? 🤔 And the normal completable futruee doesn't guarantee this ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, but I never understood why this is such a big advantage 🤷 We could likely write everything without this constraint
SetupDeployed to measurement-4671156218 camunda-platform:
zeebe:
image:
repository: gcr.io/zeebe-io/zeebe
tag: dd-12173-channel-pool-benchmark-7c7dccc
zeebe-gateway:
image:
repository: gcr.io/zeebe-io/zeebe
tag: dd-12173-channel-pool-benchmark-7c7dccc
global:
image:
tag: dd-12173-channel-pool-benchmark-7c7dccc Measurement beforeProcess Instance Execution Time: p99=1.421 p90=0.317 p50=0.081 Chaos injectionDeployed chaos network-latency-5 Measurement afterProcess Instance Execution Time: p99=4.078 p90=2.180 p50=0.857 DetailsSee https://github.com/camunda/zeebe/actions/runs/4671156218 |
If the channel is never closed, it is not removed from the pool. But I don't think that will happen quite often, right?
I would assume it doesn't have a significant impact. Th benchmark also doesn't show any impact. |
bors merge |
Build succeeded: |
Successfully created backport PR for |
Successfully created backport PR for |
Successfully created backport PR for |
Description
When only using IP to find a channel pool, we observed that a channel pool created for a node is re-used by another node which got its old ip. This could lead to unexpected situations, where messages are sent to wrong node.
To fix this, we now use both address and ip to find a channel pool. In a setup where restarts and ip re-assignment is common, it is safer this way.
Related issues
closes #12173
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.
Please refer to our review guidelines.