-
Notifications
You must be signed in to change notification settings - Fork 679
[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737
Comments
Any plans to implement return from I opened ticket that might be related #3252 |
Sorry, no plans yet. However, PR is more than welcome! |
@brb regarding, help, do you have any design documents or ideas how it should work correctly? any WIP? |
Sorry, but AFAIK there is no WIP or a design doc. Also, I haven't thought about it much. https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing |
So basically add a timer to the overlay_switch and re- The above sounds like a way to do it. But the unknown question is why fastdp is being dropped on almost every single cluster with weave-net I have seen. |
@Cryptophobia I'd recommend opening your own issue with the specifics |
thx @brb |
Any update on this? This ticket is almost 3 years old and there is still no fix. Any node reboot / temporary connectivity issue switches weave to sleeve with no way to return without a full restart. |
I am working on the fix for this issue. |
@murali-reddy Thanks for working on it. If you need any help testing please let me know. |
We prioritise requests from paying customers. Would you like information about a support contract? |
@murali-reddy i can help also. We also need this fix out soon. |
It would be nice to finally get this fixed @murali-reddy . I can also help with testing on a few clusters/environments. |
@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don't know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment) |
There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment #3376 (comment) |
I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods #3252 (comment) outlined here in order to isolate. |
changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737
from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737
So i made a fix #3385 for this issue and tested out. If any one wishes to help out with testing please use the image I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp). |
@murali-reddy we will roll our ur image in our staging cluster tonight and get back. I am also not sure how to trigger the timeout issue. |
@murali-reddy I have also deployed this custom image with fix to one of our staging clusters. |
I just saw a connection where @murali-reddy Some connections seem to start with sleeve and then upgrade to fastdp using |
@Cryptophobia @alok87 thanks for testing out the image
Yes, during the initial connections between peer you may this. This is current behaviour. Changes in the PR #3385 corresponding image i shared apply to post connection establishment between the peers. I have added little more verbose logging to heartbeat handling and forwarder selection. You should see Below logs indicate
and once
|
Yes, I see that How do I enable |
Yes, add below env variable to
|
Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now. Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:
Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched? |
@Cryptophobia When there is an issues (due to which connection gets dropped from
and then you should recovery after retry messages like below
Please see if you have a pattern like this which confirms connections are getting upgraded to |
@murali-reddy Great thank you for letting me know what to look for. Network seems to be more stable than months ago when we first noticed these sleeve downgraded connections with @notmaxx . On one of the cluster the weave-net pods have been running for 20+ hours and don't see downgraded connections at all. On the other staging cluster we had to do a new AMI roll-out so they have been only up for 1hr and I do not see any of those downgraded connection messages. Will continue to monitor and let you know if I see those messages @murali-reddy . Anyway to force the switch the hearbeat to fail inside the pods? @notmaxx Alexey you should test this too as you are the one who initially brought my attention to the downgraded connections from fastdp to sleeve. |
We can try some iptables rules to block heartbeats to simulate the switch but that does not help. I have already tested by doing so. What would be interesting is to know what is causing this downgrades in real deployments. So far most of the reports are coming from AWS users. |
Thank you very much for the work on this, I think I'll get to this next week, test and provide feedback. In my case switch happened usually within 1-3 days (depending on machine load). |
I am still not seeing the |
From the reported issues so far there is no pattern are conditions under which it can happen except that its mostly reported by AWS users. Also |
Well we will continue to monitor and let's see if we can reproduce the missed Still do not see anything on my clusters running for 2+ days now with debug mode. |
from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737
As things stand, fastdp can only be selected once, during the first
HeartbeatTimeout
seconds of a connection's life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in 'sleeve' mode.The text was updated successfully, but these errors were encountered: