Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VXLAN tests are not working on packet #68

Closed
Bolodya1997 opened this issue May 24, 2021 · 11 comments
Closed

VXLAN tests are not working on packet #68

Bolodya1997 opened this issue May 24, 2021 · 11 comments
Assignees
Labels
ASAP as soon as possible

Comments

@Bolodya1997
Copy link

https://github.com/networkservicemesh/integration-k8s-packet/actions/runs/856582486

@Bolodya1997
Copy link
Author

So the reason is somehow related to hostNetwork: true, because commenting this line makes VXLAN tests working.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented May 31, 2021

/cc @edwarnicke

@denis-tingaikin
Copy link
Member

As a workaround, we could disable hostNetwork for packet clusters.

Currently, it seems to me that something missed in the packet configuration, because packet working fine in monorepo: https://github.com/networkservicemesh/networkservicemesh/blob/master/deployments/helm/nsm/templates/forwarding-plane.tpl#L15

@denis-tingaikin
Copy link
Member

@Bolodya1997 , @d-uzlov I think we also need to check will work if we remove this line in fwrder:
https://github.com/networkservicemesh/cmd-forwarder-vpp/blob/main/internal/vppinit/vppinit.go#L205-L207

Note: We are not filtering arps in the monorepo.

@denis-tingaikin
Copy link
Member

@d-uzlov Could you please all logs that we captured related to this issue?

@edwarnicke
Copy link
Member

@DVEfremov It would be helpful to also have a trimmed down summary to what it looks like might be going wrong.

I can spot a lot of potential issues from that summary.

Things like:

Are the tests failing because the ping isn't working?
or
Are the test failing because the Request is returning an error?
If the tests are failing because the Request is returning an error, what error? Can we trace that error back to a deeper error? In what component is it originating?

Are the test failing because Close is returning an error?

Is some component panicking?

Etc

@denis-tingaikin
Copy link
Member

/cc @d-uzlov

@Bolodya1997
Copy link
Author

@edwarnicke

Are the tests failing because the ping isn't working?
or
Are the test failing because the Request is returning an error?
If the tests are failing because the Request is returning an error, what error? Can we trace that error back to a deeper error? In what component is it originating?

Are the test failing because Close is returning an error?

Is some component panicking?

  1. NSM chain works as expected - NSC receives success response with all IP/routes set. There is no panics.
  2. Kernel interfaces, routes are also set both in NSC and NSE.
  3. Ping is not working.

@edwarnicke
Copy link
Member

I think I've traced this back to a cause (not yet root).

On Packet, interfaces have multiple ipV5 addresses (136.144.51.109/31 - the Pod IP and 10.99.35.131/31) :

6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e4:43:4b:5f:6d:50 brd ff:ff:ff:ff:ff:ff
    inet 136.144.51.109/31 brd 255.255.255.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet 10.99.35.131/31 brd 255.255.255.255 scope global bond0:0
       valid_lft forever preferred_lft forever
    inet6 2604:1380:0:2c00::3/127 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::e643:4bff:fe5f:6d50/64 scope link 
       valid_lft forever preferred_lft forever

and multiple routes:

# ip route
default via 136.144.51.108 dev bond0 onlink 
10.0.0.0/8 via 10.99.35.130 dev bond0 
10.99.35.130/31 dev bond0 proto kernel scope link src 10.99.35.131 
136.144.51.108/31 dev bond0 proto kernel scope link src 136.144.51.109 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/16 dev weave proto kernel scope link src 192.168.192.0 

(ignore the last two for docker and weave)

This is different for most of our other environments which are much simpler, having a single ipv4 address for the main interface (the hostNetwork:true PodIP).

None of this is intrinsically a problem. It's just a difference.

VPP correctly picked up the PodIP for the host-bond0 address:

# vppctl show int addr
host-bond0 (up):
  L3 136.144.51.109/31

and mac address:

# vppctl show hardware   
              Name                Idx   Link  Hardware
host-bond0                         1     up   host-bond0
  Link speed: unknown
  Ethernet address e4:43:4b:5f:6d:50
  Linux PACKET socket interface

which matches the bond0 interface above.

and also correctly picks up the neighbor for it:

# vppctl show ip neighbor
    Time                       IP                    Flags      Ethernet              Interface       
      2.2429             136.144.51.108                S    b0:33:a6:fe:79:d7 host-bond0

which matches the neighbor from the kernel:

# ip neighbor | grep 136.144.51.108
136.144.51.108 dev bond0 lladdr b0:33:a6:fe:79:d7 REACHABLE

Where things go wrong is on routes:

# vppctl show ip fib
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, nat-hi:2, ]
0.0.0.0/0
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:12 to:[1:96]]
    [0] [@3]: arp-ipv4: via 38.4.19.128 host-bond0
0.0.0.0/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
    [0] [@0]: dpo-drop ip4
10.0.0.0/8
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:11 buckets:1 uRPF:11 to:[0:0]]
    [0] [@3]: arp-ipv4: via 10.99.35.130 host-bond0
10.99.35.130/31
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:11 to:[0:0]]
    [0] [@3]: arp-ipv4: via 10.99.35.130 host-bond0
136.144.51.108/31
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:13 buckets:1 uRPF:11 to:[0:0]]
    [0] [@3]: arp-ipv4: via 10.99.35.130 host-bond0
136.144.51.108/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:9 buckets:1 uRPF:9 to:[0:0]]
    [0] [@5]: ipv4 via 136.144.51.108 host-bond0: mtu:1500 next:4 b033a6fe79d7e4434b5f6d500800
136.144.51.109/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:10 buckets:1 uRPF:10 to:[1760:2611356]]
    [0] [@2]: dpo-receive: 136.144.51.109 on host-bond0
147.75.199.143/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:16 buckets:1 uRPF:12 to:[0:0]]
    [0] [@3]: arp-ipv4: via 38.4.19.128 host-bond0
224.0.0.0/4
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:4 buckets:1 uRPF:3 to:[0:0]]
    [0] [@0]: dpo-drop ip4
240.0.0.0/4
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:3 buckets:1 uRPF:2 to:[0:0]]
    [0] [@0]: dpo-drop ip4
255.255.255.255/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
    [0] [@0]: dpo-drop ip4

Most of these simply match the routes from the kernel (as expected), but there is one that is really screwed up:

0.0.0.0/0
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:12 to:[1:96]]
    [0] [@3]: arp-ipv4: via 38.4.19.128 host-bond0

It should be going to via 136.144.51.108 host-bond0 but is instead going via 38.4.19.128 host-bond0.

I have no idea where via 38.4.19.128 host-bond0 came from. Its clearly wrong.

The code that sets it comes from here:
https://github.com/networkservicemesh/cmd-forwarder-vpp/blob/0eb8dcca85c0ba98beb5d8bb89c626c13fe9b5e7/internal/vppinit/vppinit.go#L160

Its correctly adding the other routes with the correct gateway addresses.

@d-uzlov
Copy link

d-uzlov commented Jun 7, 2021

38.4.19.128 is the beginning of 2604:1380:0:2c00::2, when you take first 4 bytes.
In line 160 we lose the metadata that the address is ipv6, and it gets interpreted as ipv4.

@d-uzlov
Copy link

d-uzlov commented Jun 16, 2021

Now that packet CI is working properly again, and everything is green, we can finally close this.

@d-uzlov d-uzlov closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP as soon as possible
Projects
None yet
Development

No branches or pull requests

4 participants