Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

Closed
rade opened this issue Nov 30, 2015 · 32 comments
Closed

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

rade opened this issue Nov 30, 2015 · 32 comments
Assignees
Milestone

Comments

@rade
Copy link
Member

rade commented Nov 30, 2015

As things stand, fastdp can only be selected once, during the first HeartbeatTimeout seconds of a connection's life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in 'sleeve' mode.

@notmaxx
Copy link

notmaxx commented Mar 2, 2018

Any plans to implement return from sleeve to fastdp?

I opened ticket that might be related #3252

@brb
Copy link
Contributor

brb commented Mar 2, 2018

Sorry, no plans yet. However, PR is more than welcome!

@brb brb added the help wanted label Mar 2, 2018
@notmaxx
Copy link

notmaxx commented Mar 2, 2018

@brb regarding, help, do you have any design documents or ideas how it should work correctly? any WIP?

@brb
Copy link
Contributor

brb commented Mar 5, 2018

Sorry, but AFAIK there is no WIP or a design doc. Also, I haven't thought about it much.

https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either sleeve or fastdp) to use between two peers.
Each overlay sends a periodic heartbeat messages to detect any failures in an overlay connection. E.g. when fastdp times out waiting for a heartbeat message from a peer (https://github.com/weaveworks/weave/blob/v2.2.0/router/fastdp.go#L713) it notifies overlay_switch which consequently chooses another best overlay for the connection (https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go#L318).

A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing overlay_switch implementation will handle the rest.

@Cryptophobia
Copy link

So basically add a timer to the overlay_switch and re-chooseBest() in order to prefer fastdp after a failure until fastdp takes over again as the prefered overlay connection.

The above sounds like a way to do it. But the unknown question is why fastdp is being dropped on almost every single cluster with weave-net I have seen.

@bboreham
Copy link
Contributor

@Cryptophobia I'd recommend opening your own issue with the specifics

@Cryptophobia
Copy link

@bboreham : There are just too many clusters that have this problem with weave-net. They are all on AWS. Should I be opening an issue for each cluster. It's not sustainable and will quickly become annoying.

Also the below issues are being closed as duplicates:

Example issue here: #3252

@notmaxx
Copy link

notmaxx commented Jun 17, 2018

thx @brb

@davidhiendl
Copy link

Any update on this? This ticket is almost 3 years old and there is still no fix. Any node reboot / temporary connectivity issue switches weave to sleeve with no way to return without a full restart.

@murali-reddy
Copy link
Contributor

I am working on the fix for this issue.

@davidhiendl
Copy link

@murali-reddy Thanks for working on it. If you need any help testing please let me know.

@bboreham
Copy link
Contributor

This ticket is almost 3 years old and there is still no fix.

We prioritise requests from paying customers. Would you like information about a support contract?

@alok87
Copy link
Contributor

alok87 commented Aug 21, 2018

@murali-reddy i can help also. We also need this fix out soon.

@Cryptophobia
Copy link

It would be nice to finally get this fixed @murali-reddy . I can also help with testing on a few clusters/environments.

@alok87
Copy link
Contributor

alok87 commented Aug 25, 2018

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don't know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

@murali-reddy murali-reddy self-assigned this Aug 27, 2018
@murali-reddy
Copy link
Contributor

murali-reddy commented Aug 27, 2018

Is there a way to measure when could the switch to sleeve can happen

There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment #3376 (comment)

@Cryptophobia
Copy link

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don't know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods #3252 (comment) outlined here in order to isolate.

murali-reddy added a commit that referenced this issue Aug 28, 2018
changes differentiate beteween fatal errors (for e.g. ipsec init)
from transient errors (heartbeast misses). In case of transient
errors OverlayForwarder is marked to be unhealty. When overlay forwarder
is unhealty its hept on hold so chooseBest() will skip selecting the forwrder

Once OverlayForwarder is considered healthy run chooseBest() again so the
forwarder can be selected if its the best forwarder

Fixes #1737
murali-reddy added a commit that referenced this issue Aug 29, 2018
from transient errors (heartbeast misses). In case of transient
errors OverlayForwarder is marked to be unhealty. When overlay forwarder
is unhealty its hept on hold so chooseBest() will skip selecting the forwrder

Once OverlayForwarder is considered healthy run chooseBest() again so the
forwarder can be selected if its the best forwarder

Fixes #1737
@murali-reddy
Copy link
Contributor

murali-reddy commented Aug 29, 2018

So i made a fix #3385 for this issue and tested out.

If any one wishes to help out with testing please use the image muralireddy/weave-kube:retry-fastdp

I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp).

@alok87
Copy link
Contributor

alok87 commented Aug 29, 2018

@murali-reddy we will roll our ur image in our staging cluster tonight and get back. I am also not sure how to trigger the timeout issue.

@Cryptophobia
Copy link

@murali-reddy I have also deployed this custom image with fix to one of our staging clusters.

@Cryptophobia
Copy link

I just saw a connection where INFO: 2018/08/29 15:10:03.487707 overlay_switch ->[02:db:1d:f2:a8:c1(ip-10-8-55-37.us-west-2.compute.internal)] using sleeve and was not upgraded to fastdp.

@murali-reddy Some connections seem to start with sleeve and then upgrade to fastdp using overlay_switch function and then some start of as fastdp. Is this the intent?

@murali-reddy
Copy link
Contributor

murali-reddy commented Aug 30, 2018

@Cryptophobia @alok87 thanks for testing out the image

Some connections seem to start with sleeve and then upgrade to fastdp using overlay_switch function and then some start of as fastdp. Is this the intent?

Yes, during the initial connections between peer you may this. This is current behaviour. Changes in the PR #3385 corresponding image i shared apply to post connection establishment between the peers.

I have added little more verbose logging to heartbeat handling and forwarder selection. You should see
below DEBUG ( Note: you need to enable debug logging) logs

Below logs indicate sleeve is getting used after missed heartbeat in fastdp forwarder

DEBU: 2018/08/30 09:32:00.043314 fastdp ->[192.168.56.102:6784|62:e8:01:2a:9d:cc(192.168.56.102)]: sending Heartbeat to peer
DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders
INFO: 2018/08/30 09:32:00.781674 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using sleeve

and once fastdp in healthy state you shall see switch to fastdp and weave status connections should indicate same

DEBU: 2018/08/30 09:32:40.784561 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: handleVxlanSpecialPacket
DEBU: 2018/08/30 09:32:40.784580 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: Got Heartbeat Ack from peer
DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

@Cryptophobia
Copy link

Cryptophobia commented Aug 31, 2018

and once fastdp in healthy state you shall see switch to fastdp and weave status connections should indicate same

Yes, I see that weave status connections is indicating healthy fastdp connections.

How do I enable --log-level=debug in the k8s weave-net daemonsets. It says that it is passed to the weave launch command. Is there an ENV var that can be set?

@murali-reddy
Copy link
Contributor

Is there an ENV var that can be set?

Yes, add below env variable to weave container in weave-net daemon set.

        - name: EXTRA_ARGS
          value: --log-level=debug

@Cryptophobia
Copy link

Cryptophobia commented Sep 4, 2018

Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now.

Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:

DEBU: 2018/09/04 18:59:49.749160 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sendHeartbeat
DEBU: 2018/09/04 18:59:49.749792 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sending Heartbeat to peer
DEBU: 2018/09/04 18:59:49.754691 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleHeartbeat
DEBU: 2018/09/04 18:59:49.759046 ODP miss: map[3:InPortFlowKey{vport: 2} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 22788, tpdst: 6784} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 4:EthernetFlowKey{src: 00:00:00:00:00:00, dst: 00:00:00:00:00:00}] on port 2
DEBU: 2018/09/04 18:59:49.759122 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleVxlanSpecialPacket
DEBU: 2018/09/04 18:59:49.759149 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: Got Heartbeat Ack from peer
DEBU: 2018/09/04 18:59:50.619606 ODP miss: map[19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 18:BlobFlowKey{type: 18, key: 0002, mask: ffff} 3:InPortFlowKey{vport: 2} 9:BlobFlowKey{type: 9, key: cf4618eb, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 36937, tpdst: 6784} 6:BlobFlowKey{type: 6, key: 0800, mask: ffff} 4:EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 7:BlobFlowKey{type: 7, key: 646c00026460000d06004000, mask: ffffffffffffffffffffffff}] on port 2
DEBU: 2018/09/04 18:59:50.619913 Creating ODP flow FlowSpec{keys: [InPortFlowKey{vport: 2} TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28} EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39}], actions: [OutputAction{vport: 1}]}
DEBU: 2018/09/04 18:59:50.620116 Creating ODP flow FlowSpec{keys: [EthernetFlowKey{src: 32:ez:32:25:5f:39, dst: c5:d2:f2:ef:e9:5f} InPortFlowKey{vport: 1}], actions: [SetTunnelAction{id: 000000000075bfbc, ipv4src: 172.33.140.28, ipv4dst: 172.33.88.133, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]}

Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

@murali-reddy
Copy link
Contributor

Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

@Cryptophobia When there is an issues (due to which connection gets dropped from fastdp to sleeve) you should see a log message like below.

DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders

and then you should recovery after retry messages like below

DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

Please see if you have a pattern like this which confirms connections are getting upgraded to fastdp

@Cryptophobia
Copy link

@murali-reddy Great thank you for letting me know what to look for. Network seems to be more stable than months ago when we first noticed these sleeve downgraded connections with @notmaxx .

On one of the cluster the weave-net pods have been running for 20+ hours and don't see downgraded connections at all. On the other staging cluster we had to do a new AMI roll-out so they have been only up for 1hr and I do not see any of those downgraded connection messages. Will continue to monitor and let you know if I see those messages @murali-reddy . Anyway to force the switch the hearbeat to fail inside the pods?

@notmaxx Alexey you should test this too as you are the one who initially brought my attention to the downgraded connections from fastdp to sleeve.

@murali-reddy
Copy link
Contributor

murali-reddy commented Sep 6, 2018

Anyway to force the switch the hearbeat to fail inside the pods?

We can try some iptables rules to block heartbeats to simulate the switch but that does not help. I have already tested by doing so. What would be interesting is to know what is causing this downgrades in real deployments. So far most of the reports are coming from AWS users.

@notmaxx
Copy link

notmaxx commented Sep 6, 2018

Thank you very much for the work on this, I think I'll get to this next week, test and provide feedback. In my case switch happened usually within 1-3 days (depending on machine load).

@Cryptophobia
Copy link

So far most of the reports are coming from AWS users.

In my case switch happened usually within 1-3 days (depending on machine load).

I am still not seeing the missed Heartbeat from peer, marking fastdp forwarder as un-healthy messages in my AWS staging clusters which are under light load. I wonder if this problem is only apparent under heavy machine load.

@murali-reddy
Copy link
Contributor

I wonder if this problem is only apparent under heavy machine load.

From the reported issues so far there is no pattern are conditions under which it can happen except that its mostly reported by AWS users.

Also heartbeats are sent every 10 seconds, and fallback to sleeve mode from fastdp happen only after a min i.e) 6 consecutive missed heartbeats. So its puzzling under what conditions this can happen.

@Cryptophobia
Copy link

Cryptophobia commented Sep 7, 2018

Well we will continue to monitor and let's see if we can reproduce the missed heartbeats.

Still do not see anything on my clusters running for 2+ days now with debug mode.

murali-reddy added a commit that referenced this issue Sep 24, 2018
from transient errors (heartbeast misses). In case of transient
errors OverlayForwarder is marked to be unhealty. When overlay forwarder
is unhealty its hept on hold so chooseBest() will skip selecting the forwrder

Once OverlayForwarder is considered healthy run chooseBest() again so the
forwarder can be selected if its the best forwarder

Fixes #1737
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants