[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

rade · 2015-11-30T14:58:56Z

As things stand, fastdp can only be selected once, during the first HeartbeatTimeout seconds of a connection's life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in 'sleeve' mode.

The text was updated successfully, but these errors were encountered:

notmaxx · 2018-03-02T04:04:53Z

Any plans to implement return from sleeve to fastdp?

I opened ticket that might be related #3252

brb · 2018-03-02T13:29:23Z

Sorry, no plans yet. However, PR is more than welcome!

notmaxx · 2018-03-02T19:26:51Z

@brb regarding, help, do you have any design documents or ideas how it should work correctly? any WIP?

brb · 2018-03-05T11:32:10Z

Sorry, but AFAIK there is no WIP or a design doc. Also, I haven't thought about it much.

https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either sleeve or fastdp) to use between two peers.
Each overlay sends a periodic heartbeat messages to detect any failures in an overlay connection. E.g. when fastdp times out waiting for a heartbeat message from a peer (https://github.com/weaveworks/weave/blob/v2.2.0/router/fastdp.go#L713) it notifies overlay_switch which consequently chooses another best overlay for the connection (https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go#L318).

A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing overlay_switch implementation will handle the rest.

Cryptophobia · 2018-06-14T20:42:08Z

So basically add a timer to the overlay_switch and re-chooseBest() in order to prefer fastdp after a failure until fastdp takes over again as the prefered overlay connection.

The above sounds like a way to do it. But the unknown question is why fastdp is being dropped on almost every single cluster with weave-net I have seen.

bboreham · 2018-06-14T20:54:43Z

@Cryptophobia I'd recommend opening your own issue with the specifics

Cryptophobia · 2018-06-14T21:13:43Z

@bboreham : There are just too many clusters that have this problem with weave-net. They are all on AWS. Should I be opening an issue for each cluster. It's not sustainable and will quickly become annoying.

Also the below issues are being closed as duplicates:

Example issue here: #3252

notmaxx · 2018-06-17T02:29:21Z

thx @brb

davidhiendl · 2018-08-20T12:19:40Z

Any update on this? This ticket is almost 3 years old and there is still no fix. Any node reboot / temporary connectivity issue switches weave to sleeve with no way to return without a full restart.

murali-reddy · 2018-08-20T12:43:01Z

I am working on the fix for this issue.

davidhiendl · 2018-08-20T13:02:37Z

@murali-reddy Thanks for working on it. If you need any help testing please let me know.

bboreham · 2018-08-20T14:45:07Z

This ticket is almost 3 years old and there is still no fix.

We prioritise requests from paying customers. Would you like information about a support contract?

alok87 · 2018-08-21T18:17:59Z

@murali-reddy i can help also. We also need this fix out soon.

Cryptophobia · 2018-08-25T03:07:52Z

It would be nice to finally get this fixed @murali-reddy . I can also help with testing on a few clusters/environments.

alok87 · 2018-08-25T21:33:58Z

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don't know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

murali-reddy · 2018-08-27T08:29:34Z

Is there a way to measure when could the switch to sleeve can happen

There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment #3376 (comment)

Cryptophobia · 2018-08-27T14:19:33Z

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don't know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods #3252 (comment) outlined here in order to isolate.

changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737

from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737

murali-reddy · 2018-08-29T08:53:52Z

So i made a fix #3385 for this issue and tested out.

If any one wishes to help out with testing please use the image muralireddy/weave-kube:retry-fastdp

I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp).

alok87 · 2018-08-29T11:39:14Z

@murali-reddy we will roll our ur image in our staging cluster tonight and get back. I am also not sure how to trigger the timeout issue.

Cryptophobia · 2018-08-29T15:07:57Z

@murali-reddy I have also deployed this custom image with fix to one of our staging clusters.

Cryptophobia · 2018-08-29T15:26:41Z

I just saw a connection where INFO: 2018/08/29 15:10:03.487707 overlay_switch ->[02:db:1d:f2:a8:c1(ip-10-8-55-37.us-west-2.compute.internal)] using sleeve and was not upgraded to fastdp.

@murali-reddy Some connections seem to start with sleeve and then upgrade to fastdp using overlay_switch function and then some start of as fastdp. Is this the intent?

murali-reddy · 2018-08-30T09:43:34Z

@Cryptophobia @alok87 thanks for testing out the image

Some connections seem to start with sleeve and then upgrade to fastdp using overlay_switch function and then some start of as fastdp. Is this the intent?

Yes, during the initial connections between peer you may this. This is current behaviour. Changes in the PR #3385 corresponding image i shared apply to post connection establishment between the peers.

I have added little more verbose logging to heartbeat handling and forwarder selection. You should see
below DEBUG ( Note: you need to enable debug logging) logs

Below logs indicate sleeve is getting used after missed heartbeat in fastdp forwarder

DEBU: 2018/08/30 09:32:00.043314 fastdp ->[192.168.56.102:6784|62:e8:01:2a:9d:cc(192.168.56.102)]: sending Heartbeat to peer
DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders
INFO: 2018/08/30 09:32:00.781674 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using sleeve

and once fastdp in healthy state you shall see switch to fastdp and weave status connections should indicate same

DEBU: 2018/08/30 09:32:40.784561 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: handleVxlanSpecialPacket
DEBU: 2018/08/30 09:32:40.784580 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: Got Heartbeat Ack from peer
DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

Cryptophobia · 2018-08-31T19:54:06Z

and once fastdp in healthy state you shall see switch to fastdp and weave status connections should indicate same

Yes, I see that weave status connections is indicating healthy fastdp connections.

How do I enable --log-level=debug in the k8s weave-net daemonsets. It says that it is passed to the weave launch command. Is there an ENV var that can be set?

murali-reddy · 2018-09-03T04:07:14Z

Is there an ENV var that can be set?

Yes, add below env variable to weave container in weave-net daemon set.

        - name: EXTRA_ARGS
          value: --log-level=debug

Cryptophobia · 2018-09-04T19:13:58Z

Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now.

Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:

DEBU: 2018/09/04 18:59:49.749160 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sendHeartbeat
DEBU: 2018/09/04 18:59:49.749792 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sending Heartbeat to peer
DEBU: 2018/09/04 18:59:49.754691 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleHeartbeat
DEBU: 2018/09/04 18:59:49.759046 ODP miss: map[3:InPortFlowKey{vport: 2} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 22788, tpdst: 6784} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 4:EthernetFlowKey{src: 00:00:00:00:00:00, dst: 00:00:00:00:00:00}] on port 2
DEBU: 2018/09/04 18:59:49.759122 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleVxlanSpecialPacket
DEBU: 2018/09/04 18:59:49.759149 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: Got Heartbeat Ack from peer
DEBU: 2018/09/04 18:59:50.619606 ODP miss: map[19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 18:BlobFlowKey{type: 18, key: 0002, mask: ffff} 3:InPortFlowKey{vport: 2} 9:BlobFlowKey{type: 9, key: cf4618eb, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 36937, tpdst: 6784} 6:BlobFlowKey{type: 6, key: 0800, mask: ffff} 4:EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 7:BlobFlowKey{type: 7, key: 646c00026460000d06004000, mask: ffffffffffffffffffffffff}] on port 2
DEBU: 2018/09/04 18:59:50.619913 Creating ODP flow FlowSpec{keys: [InPortFlowKey{vport: 2} TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28} EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39}], actions: [OutputAction{vport: 1}]}
DEBU: 2018/09/04 18:59:50.620116 Creating ODP flow FlowSpec{keys: [EthernetFlowKey{src: 32:ez:32:25:5f:39, dst: c5:d2:f2:ef:e9:5f} InPortFlowKey{vport: 1}], actions: [SetTunnelAction{id: 000000000075bfbc, ipv4src: 172.33.140.28, ipv4dst: 172.33.88.133, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]}

Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

murali-reddy · 2018-09-05T08:17:10Z

Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

@Cryptophobia When there is an issues (due to which connection gets dropped from fastdp to sleeve) you should see a log message like below.

DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders

and then you should recovery after retry messages like below

DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

Please see if you have a pattern like this which confirms connections are getting upgraded to fastdp

Cryptophobia · 2018-09-05T16:18:40Z

@murali-reddy Great thank you for letting me know what to look for. Network seems to be more stable than months ago when we first noticed these sleeve downgraded connections with @notmaxx .

On one of the cluster the weave-net pods have been running for 20+ hours and don't see downgraded connections at all. On the other staging cluster we had to do a new AMI roll-out so they have been only up for 1hr and I do not see any of those downgraded connection messages. Will continue to monitor and let you know if I see those messages @murali-reddy . Anyway to force the switch the hearbeat to fail inside the pods?

@notmaxx Alexey you should test this too as you are the one who initially brought my attention to the downgraded connections from fastdp to sleeve.

murali-reddy · 2018-09-06T04:28:33Z

Anyway to force the switch the hearbeat to fail inside the pods?

We can try some iptables rules to block heartbeats to simulate the switch but that does not help. I have already tested by doing so. What would be interesting is to know what is causing this downgrades in real deployments. So far most of the reports are coming from AWS users.

notmaxx · 2018-09-06T05:01:18Z

Thank you very much for the work on this, I think I'll get to this next week, test and provide feedback. In my case switch happened usually within 1-3 days (depending on machine load).

Cryptophobia · 2018-09-06T15:44:19Z

So far most of the reports are coming from AWS users.

In my case switch happened usually within 1-3 days (depending on machine load).

I am still not seeing the missed Heartbeat from peer, marking fastdp forwarder as un-healthy messages in my AWS staging clusters which are under light load. I wonder if this problem is only apparent under heavy machine load.

murali-reddy · 2018-09-07T07:17:05Z

I wonder if this problem is only apparent under heavy machine load.

From the reported issues so far there is no pattern are conditions under which it can happen except that its mostly reported by AWS users.

Also heartbeats are sent every 10 seconds, and fallback to sleeve mode from fastdp happen only after a min i.e) 6 consecutive missed heartbeats. So its puzzling under what conditions this can happen.

Cryptophobia · 2018-09-07T16:21:32Z

Well we will continue to monitor and let's see if we can reproduce the missed heartbeats.

Still do not see anything on my clusters running for 2+ days now with debug mode.

from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forwarder is unhealty its hept on hold so chooseBest() will skip selecting the forwrder Once OverlayForwarder is considered healthy run chooseBest() again so the forwarder can be selected if its the best forwarder Fixes #1737

rade added the bug label Nov 30, 2015

This was referenced Nov 30, 2015

[fastdp] inefficiency and packet loss in mixed version networks due to lack of HasShortID merging #1731

Closed

[fastdp] heartbeat dependency on ShortIDs can cause spurious fallback to 'sleeve' #1739

Open

awh mentioned this issue Nov 8, 2016

Go from sleeve to fastdp #2600

Closed

awh mentioned this issue Jan 9, 2017

Fastdp crypto #2687

Merged

4 tasks

brb added this to the 1.9.0 milestone Jan 10, 2017

brb self-assigned this Jan 10, 2017

brb removed this from the 1.9.0 milestone Jan 12, 2017

brb removed their assignment Jan 12, 2017

rade mentioned this issue Jun 26, 2017

Fastdp heart beat time out #3037

Closed

brb added the help wanted label Mar 2, 2018

brb mentioned this issue Jun 16, 2018

weave doesn't stay on fastdp, switches to sleeve and never comes back #3252

Closed

brb added this to the 2.5 milestone Jun 16, 2018

alok87 mentioned this issue Aug 21, 2018

Node deletion does not clear up the IPs #3372

Closed

murali-reddy self-assigned this Aug 27, 2018

murali-reddy mentioned this issue Aug 28, 2018

Add logic to re-try chooseBest() (which will select best forwarder) #3385

Merged

bboreham closed this as completed in #3385 Oct 16, 2018

Binyang2014 mentioned this issue Jun 15, 2020

Weave-net slow down MPI job microsoft/pai#4622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

rade commented Nov 30, 2015

notmaxx commented Mar 2, 2018 •

edited

Loading

brb commented Mar 2, 2018

notmaxx commented Mar 2, 2018

brb commented Mar 5, 2018

Cryptophobia commented Jun 14, 2018

bboreham commented Jun 14, 2018

Cryptophobia commented Jun 14, 2018

notmaxx commented Jun 17, 2018

davidhiendl commented Aug 20, 2018

murali-reddy commented Aug 20, 2018

davidhiendl commented Aug 20, 2018

bboreham commented Aug 20, 2018

alok87 commented Aug 21, 2018

Cryptophobia commented Aug 25, 2018

alok87 commented Aug 25, 2018 •

edited

Loading

murali-reddy commented Aug 27, 2018 •

edited

Loading

Cryptophobia commented Aug 27, 2018

murali-reddy commented Aug 29, 2018 •

edited

Loading

alok87 commented Aug 29, 2018 •

edited

Loading

Cryptophobia commented Aug 29, 2018

Cryptophobia commented Aug 29, 2018

murali-reddy commented Aug 30, 2018 •

edited

Loading

Cryptophobia commented Aug 31, 2018 •

edited

Loading

murali-reddy commented Sep 3, 2018

Cryptophobia commented Sep 4, 2018 •

edited

Loading

murali-reddy commented Sep 5, 2018

Cryptophobia commented Sep 5, 2018

murali-reddy commented Sep 6, 2018 •

edited

Loading

notmaxx commented Sep 6, 2018

Cryptophobia commented Sep 6, 2018

murali-reddy commented Sep 7, 2018

Cryptophobia commented Sep 7, 2018 •

edited

Loading

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

[fastdp] can get stuck in 'sleeve' mode even though fastdp is possible #1737

Comments

rade commented Nov 30, 2015

notmaxx commented Mar 2, 2018 • edited Loading

brb commented Mar 2, 2018

notmaxx commented Mar 2, 2018

brb commented Mar 5, 2018

Cryptophobia commented Jun 14, 2018

bboreham commented Jun 14, 2018

Cryptophobia commented Jun 14, 2018

notmaxx commented Jun 17, 2018

davidhiendl commented Aug 20, 2018

murali-reddy commented Aug 20, 2018

davidhiendl commented Aug 20, 2018

bboreham commented Aug 20, 2018

alok87 commented Aug 21, 2018

Cryptophobia commented Aug 25, 2018

alok87 commented Aug 25, 2018 • edited Loading

murali-reddy commented Aug 27, 2018 • edited Loading

Cryptophobia commented Aug 27, 2018

murali-reddy commented Aug 29, 2018 • edited Loading

alok87 commented Aug 29, 2018 • edited Loading

Cryptophobia commented Aug 29, 2018

Cryptophobia commented Aug 29, 2018

murali-reddy commented Aug 30, 2018 • edited Loading

Cryptophobia commented Aug 31, 2018 • edited Loading

murali-reddy commented Sep 3, 2018

Cryptophobia commented Sep 4, 2018 • edited Loading

murali-reddy commented Sep 5, 2018

Cryptophobia commented Sep 5, 2018

murali-reddy commented Sep 6, 2018 • edited Loading

notmaxx commented Sep 6, 2018

Cryptophobia commented Sep 6, 2018

murali-reddy commented Sep 7, 2018

Cryptophobia commented Sep 7, 2018 • edited Loading

notmaxx commented Mar 2, 2018 •

edited

Loading

alok87 commented Aug 25, 2018 •

edited

Loading

murali-reddy commented Aug 27, 2018 •

edited

Loading

murali-reddy commented Aug 29, 2018 •

edited

Loading

alok87 commented Aug 29, 2018 •

edited

Loading

murali-reddy commented Aug 30, 2018 •

edited

Loading

Cryptophobia commented Aug 31, 2018 •

edited

Loading

Cryptophobia commented Sep 4, 2018 •

edited

Loading

murali-reddy commented Sep 6, 2018 •

edited

Loading

Cryptophobia commented Sep 7, 2018 •

edited

Loading