-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automating with zbchaos fails with time outs on gateway during deploying processes #294
Comments
Did we try checking network traffic with Wireshark/tshark? What does it look like? |
What we (@npepinpe and I ) did so far:
|
🕵️ What I tried the last days:Investigate Trace logsI had a look at it together with @deepthidevaki. We enabled trace logging everywhere, I was able to see that old IP addresses have been in uses, for ~20 min. This also explains why other requests on the gateway fail, like job activation for connectors. Details:The gateway keeps old IP addresses and it takes some while to renew them (BTW DNS TTL is 30 min) When the experiment failed I checked the IPs of the broker pods:
We can see in the logs that is connecting to the wrong IPs for Zeebe 1 Furthermore, I can see in the heap dump of the gateway several old IP's lurking around. I think it is related to that, and if the IP is updated again at around Disable some experimentI tried to disable the multiple leader experiment, to see whether it runs through but it is still failing. It interesting is also that I have introduced randomness in choosing the gateway, and even that doesn't resolve the issue, but I guess this makes sense since both gateways will have the same issue. Integration TestsI created a new cluster in SaaS (967418c4-dd62-4230-939b-0597897d8685) and tried to run my integration test against that. In order to do that I need to switch my k8 context and k8 namespace to the created cluster, then I can run in that context run the experiment automated with the integration setup I have created. It was reproduced 💪 Keep in mind that I was not able to reproduce this in self-managed after fixing the issue with the stress experiment. Looks like an issue with the SaaS infra setup 🤔 NextI tried to rerun the experiment, but looks like my cluster is unhealthy I need to check whether this is related to my executions. Try to run it again, whether it is really reproducible and ping SRE if this is the case. Potentially we could overcome this to not always pick leader for partition one, on the chaos experiments for now. |
Impressive investigation 🚀 I'm surprised we've never had issues with this in prod, where the gateway fails to contact certain brokers because of this 🤔 One workaround could be to use IP addresses instead of hostnames. We want hostnames for configuration purposes (so users don't have to reconfigure everything every time an IP changes), but perhaps nodes could advertise their current IP address in gossip, such that the latest IP is always picked up eventually regardless of DNS issues? 🤔 This might not work with nodes behind reverse proxies though :( |
A similar issue camunda/camunda#8264 which @oleschoenburg mentioned in slack. |
Investigation:What I found out so far, is that it looks like we have two separate issues. One is that our infrastructure has some issues with setup pods via calico (?) I have observed during executing some experiments that a Broker was not able to come up, because it had no Secondly, we have the issue with the gateways described above not being able to resolve the correct IP. I have investigated this further as well. Details: Broker doesn't come upWhen running some experiments against my cluster, which looked healthy before it failed because one pod didn't come up and it looked like it can't reach the others, so I did the following checks. $ kubectl debug -it --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --target=zeebe zeebe-2
Targeting container "zeebe". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-jb6mn.
If you don't see a command prompt, try pressing enter.
root@zeebe-2:/#
root@zeebe-2:/# nslookup zeebe-0.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local:26501
;; connection timed out; no servers could be reached
root@zeebe-2:/# nslook up zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local
bash: nslook: command not found
root@zeebe-2:/# nslookup zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local
;; connection timed out; no servers could be reached I checked whether the nameserver is reachable
It was not:
When checking the network interfaces, I saw that
SRE was able to fix this by, restarting the calico nodes. Still, it is unclear why this happened. If we see that again we should ping SRE. Details: Gateway can't talk with brokersAfter the issue from above was resolved I run again our experiments against the cluster. This time I saw again the gateway issues, that it can't talk with the other brokers. The client was not able to deploy any processes. I took again a heap dump and checked whether I can resolve the addresses on the pod. Name resolving was in general possible, but the gateway had still old IP's in the heap. I have investigated a bit further how the DNS TTL works in Java and checked our code a bit regarding this issue. If found several resources, like https://stackoverflow.com/questions/1256556/how-to-make-java-honor-the-dns-caching-timeout which point out that you can configure DNS TTL, if you want but if not set it should be around 30 seconds. I was able to prove that based on the code and heap dump as well. Taking a look at how the addresses are cached in InetAddress and how the expiry is calculated we can see the following: Furthermore, I was able to find the CachedAddresses in the heap dump I took earlier. Since they use nanoseconds as expiry time, we can't just use the normal Unix timestamp to calculate the time. We need to use the same reference as the System.nanoTime() used when creating the timestamp. I checked for the ActorClock nanoseconds field to have a reference and to verify how far in the future this expiry would be. As we can see based on the values is that the difference is ~12 seconds I have also further checked whether we reuse any IP or addresses, but we seem to generously create new Address objects anytime, and also try to resolve the addresses when connecting. Details: Chaos Experiment changesI tried to overcome the issue, by not always choosing the same leader. Previously in our experiments we often just used partition one leader, I thought this might be an issue to restart the node maybe too often. It is likely that the same node becomes leader again if it has the longes log 🤷 This didn't help, unfortunately. Current State - NextI guess I need to dig a bit deeper. I have also improved our current toString from Right now I feel it is an infrastructure issue, and not a code issue, since we are also easy to reproduce it in SaaS (not sure whether in self-managed I ever saw it. Have to check again.) |
Just a heads up: I wrote an runbook regarding the dns troubleshooting https://github.com/camunda-cloud/runbook/pull/154 I investigated further why addresses seem to be reused. I might be have an idea, it looks like that we open channels and reuse them for a longer period of time. Interesting is that we not close the channel on timeout, which might cause the issue which we currently see. I tried to fix it with camunda/camunda@2fcf80e Unfortunatelty I doesn't resolve the issue yet, but I might have some more insights since I also added some more details for the logging as also described above. What I was able to see is that: When the deploy is sent: Deploy file bpmn/one_task.bpmn (size: 2526 bytes). We direct it to the zeebe-1 Send request 808630568 to zeebe-1.zeebe-broker-service.ef47e566-7c38-42d1-b815-91efbb7ff317-zeebe.svc.cluster.local/:26501 with topic command-api-1 The request seem to fail later Request 808630568 failed, will not retry! We seem to resolve the correct address, see here Activating server protocol version V2 for connection to /10.56.21.41:35654 But somehow the port is wrong ? @npepinpe do you know why this could happen ? Or is this not relevant? We are also able to resolve the correct address here Resolved Address zeebe-broker-service/10.56.21.41:26502 to bootstrap client. but somehow it doesn't work. Would be happy for any input. |
Just for posterity: Hey Team, small update from my side. Last week I tried further investigate why integration between zbchaos and testbench fails. TL;DR; As far as I know now is that it seems to be related to the setup in SaaS, but it has to do something within the gateway which doesn't work well. Recap: What is the problem
:male-detective::skin-tone-3: Investigation summary No persisted Commands
Old IP's
Tried workarounds:
Other issues:
Configuration:
Code:
❓ Open Questions:
Thanks to @deepthi and @nicolas for their input. :man-bowing::skin-tone-3: Would be happy if someone has time to spare with me regarding the NettyMessaging code. |
Yesterday I worked on adding some new metrics to the NettyMessagingService https://github.com/camunda/zeebe/tree/zell-zbchaos-investigation I have rerun the chaos experiments and we can see clearly in the metrics that for quite a while it is sending requests to the wrong IP! The executed experiment failed with: 'Should be able to create process instances on partition two' The zbchaos failed the first time around 2022-12-20 08:30:10.171 CET with: From 8:28 until 8:40 the gateway is sending to
Update: We can see in the gateway metrics that around this time the failure rate goes up \cc @npepinpe |
11307: Close channels on Timeout r=Zelldon a=Zelldon ## Description We have seen in several experiments that channels have been reused even after a long period of unavailability of the other side. Meaning that the brokers were already restarted and had new IPs but gateways still have sent requests to old IPs without realizing the issue. The problem here is that we have implemented the timeout on top because as far as we can see right now the channels don't provide that feature. The channel response handling was never called, which means no action could be taken here. Our fix is now to listen for the timeout exception in order to close the channel. This will cause us to create a new channel on the next request, which allows us to reconnect to the correct node/IP. Please see related investigation camunda/zeebe-chaos#294 I tried to add two new test cases, one which verifies that the channel is closed on timeout and another one that verifies that a new channel is opened after the timeout and the next request. The NettMessagingService is right now hard to test and has a lot of internals, which we need for our tests which causes to be our tests not that clean as I would like to have them. I hope we can improve soon on this code part or replace them with grpc. I will run some benchmarks with this branch to see whether it really solves this issue, but a previous QA run succeeded 🤞 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes camunda/zeebe-chaos#294 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
11307: Close channels on Timeout r=Zelldon a=Zelldon ## Description We have seen in several experiments that channels have been reused even after a long period of unavailability of the other side. Meaning that the brokers were already restarted and had new IPs but gateways still have sent requests to old IPs without realizing the issue. The problem here is that we have implemented the timeout on top because as far as we can see right now the channels don't provide that feature. The channel response handling was never called, which means no action could be taken here. Our fix is now to listen for the timeout exception in order to close the channel. This will cause us to create a new channel on the next request, which allows us to reconnect to the correct node/IP. Please see related investigation camunda/zeebe-chaos#294 I tried to add two new test cases, one which verifies that the channel is closed on timeout and another one that verifies that a new channel is opened after the timeout and the next request. The NettMessagingService is right now hard to test and has a lot of internals, which we need for our tests which causes to be our tests not that clean as I would like to have them. I hope we can improve soon on this code part or replace them with grpc. I will run some benchmarks with this branch to see whether it really solves this issue, but a previous QA run succeeded 🤞 <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> closes camunda/zeebe-chaos#294 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
11311: [Backport stable/8.1] Close channels on Timeout r=Zelldon a=backport-action # Description Backport of #11307 to `stable/8.1`. relates to camunda/zeebe-chaos#294 camunda/zeebe-chaos#294 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
11311: [Backport stable/8.1] Close channels on Timeout r=Zelldon a=backport-action # Description Backport of #11307 to `stable/8.1`. relates to camunda/zeebe-chaos#294 camunda/zeebe-chaos#294 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
This issue was fixed in Zeebe via camunda/camunda#11307
The fix was merged on main and the scheduled QA run succeeded. After backporting to 8.1 the QA also succeeded again here Self-managed is also running through, when executing IT locally. There is still a question why it never failed with self-managed. My thought was that maybe gateway might be restarted as well in between, since we have seen restarting the gateway helps. @npepinpe mentioned that it might be related to the IP assignments, that in self-managed we get the same IP again, which works in this case in SaaS we have much more clusters and it is likely to not get the same address. At this point it is not 100% clear, but I feel it is good for now and we can close this issue. |
11312: [Backport 8.0]: Close channels on Timeout r=Zelldon a=Zelldon ## Description Backports #11307 <!-- Please explain the changes you made here. --> I had to manually backport the #11307, since the `@VisibleForTesting` annotation doesn't exist. I dropped these commits, I don't think it is worth to backport it. ## Related issues closes camunda/zeebe-chaos#294 <!-- Which issues are closed by this PR or are related --> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
11312: [Backport 8.0]: Close channels on Timeout r=Zelldon a=Zelldon ## Description Backports #11307 <!-- Please explain the changes you made here. --> I had to manually backport the #11307, since the `@VisibleForTesting` annotation doesn't exist. I dropped these commits, I don't think it is worth to backport it. ## Related issues closes camunda/zeebe-chaos#294 <!-- Which issues are closed by this PR or are related --> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
11312: [Backport 8.0]: Close channels on Timeout r=Zelldon a=Zelldon ## Description Backports #11307 <!-- Please explain the changes you made here. --> I had to manually backport the #11307, since the `@VisibleForTesting` annotation doesn't exist. I dropped these commits, I don't think it is worth to backport it. ## Related issues closes camunda/zeebe-chaos#294 <!-- Which issues are closed by this PR or are related --> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
After restructuring the zbchaos and running against our clusters, it often fails with timeout errors on deploying a process model.
This happened when automating with Testbench, running against SaaS, but also when executing locally and running against a self-managed cluster.
Interesting is that when this happens with Testbench and the deployed zbchaos worker the worker retries every 5 minutes (after job timeout) and fails again until the experiment fails due to timeout. I was not able to deploy either by hand with zbctl, which is a bit suspicious to me. It looks like that there is an issue in the gateway or broker when running such experiments.
Details
As mentioned above I have run the experiment locally, against an self-managed cluster and increased the log level (Zeebe Trace, Atomic Debug).
When the experiment fail then we can see in the gateway, that the request is sent to the broker:
(At least I hope this is the request ?!)
In the broker I see several issues polling etc between leader and followers and later having different leaders.
Furthermore I see that an response can't be sent back (from broker 0, which is our leader at this time):
The text was updated successfully, but these errors were encountered: