Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access coredevice across subnets after DHCP feature merge #1930

Closed
airwoodix opened this issue Jul 6, 2022 · 15 comments · Fixed by #2059
Closed

Cannot access coredevice across subnets after DHCP feature merge #1930

airwoodix opened this issue Jul 6, 2022 · 15 comments · Fixed by #2059
Assignees
Milestone

Comments

@airwoodix
Copy link
Contributor

Bug Report

One-Line Summary

For gateware/firmware built against an ARTIQ-7 revision after (including) c60de48 (smoltcp update and DHCP feature), the coredevice cannot be accessed across subnets.

This is a regression compared to gateware built against 06ad76b.

Issue Details

The coredevice is configured with a static IP XX.YY.0.137. With gateware/firmware built against 06ad76b, pings from XX.YY.0.5 (frames 1 to 4) as wells as from XX.YY.2.5 (frames 5 to 8) are successful:

No.     Time           Source                Destination           Protocol Length Info
      1 0.000000       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9f12, seq=0/0, ttl=64 (reply in 2)

No.     Time           Source                Destination           Protocol Length Info
      2 0.000260       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x9f12, seq=0/0, ttl=64 (request in 1)

No.     Time           Source                Destination           Protocol Length Info
      3 1.014239       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9f12, seq=1/256, ttl=64 (reply in 4)

No.     Time           Source                Destination           Protocol Length Info
      4 1.014473       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x9f12, seq=1/256, ttl=64 (request in 3)

No.     Time           Source                Destination           Protocol Length Info
      5 5.509846       XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0xe719, seq=0/0, ttl=64 (reply in 6)

No.     Time           Source                Destination           Protocol Length Info
      6 5.510131       XX.YY.0.137           XX.YY.2.5            ICMP     98     Echo (ping) reply    id=0xe719, seq=0/0, ttl=64 (request in 5)

No.     Time           Source                Destination           Protocol Length Info
      7 6.522022       XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0xe719, seq=1/256, ttl=64 (reply in 8)

No.     Time           Source                Destination           Protocol Length Info
      8 6.522258       XX.YY.0.137           XX.YY.2.5            ICMP     98     Echo (ping) reply    id=0xe719, seq=1/256, ttl=64 (request in 7)

With gateware/firmware built against d17675e (to this date, any revision after the DHCP feature), pings from the same subnet (frames 4 to 9) still succeed, with a small hickup in the beginning (frames 1 and 2), but pings from another subnet (frames 10 to 13) do not find their way back to the ping source:

No.     Time           Source                Destination           Protocol Length Info
      1 0.000000       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x336a, seq=0/0, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
      2 0.000216       Microchi_aa:bb:cc     Broadcast             ARP      60     Who has XX.YY.0.5? Tell XX.YY.0.137

Frame 2: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address:  XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address:  XX.YY.0.5

No.     Time           Source                Destination           Protocol Length Info
      3 0.000236                                                            42     <Ignored>

Frame 3: 42 bytes on wire (336 bits), 42 bytes captured (336 bits)
This frame is marked as ignored

No.     Time           Source                Destination           Protocol Length Info
      4 1.000802       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x336a, seq=1/256, ttl=64 (reply in 5)

No.     Time           Source                Destination           Protocol Length Info
      5 1.001035       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x336a, seq=1/256, ttl=64 (request in 4)

No.     Time           Source                Destination           Protocol Length Info
      6 4.704100       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x0d73, seq=0/0, ttl=64 (reply in 7)

No.     Time           Source                Destination           Protocol Length Info
      7 4.704337       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x0d73, seq=0/0, ttl=64 (request in 6)

No.     Time           Source                Destination           Protocol Length Info
      8 5.706916       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x0d73, seq=1/256, ttl=64 (reply in 9)

No.     Time           Source                Destination           Protocol Length Info
      9 5.707163       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x0d73, seq=1/256, ttl=64 (request in 8)

No.     Time           Source                Destination           Protocol Length Info
     10 15.979134      XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9613, seq=0/0, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
     11 15.979358      Microchi_aa:bb:cc     Broadcast             ARP      60     Who has  XX.YY.2.5? Tell  XX.YY.0.137

Frame 11: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address:  XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address:  XX.YY.2.5

No.     Time           Source                Destination           Protocol Length Info
     12 16.995054      XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9613, seq=1/256, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
     13 16.995268      Microchi_aa:bb:cc     Broadcast             ARP      60     Who has XX.YY.2.5? Tell XX.YY.0.137

Frame 13: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address: XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address: XX.YY.2.5

The firewall configuration is unchanged between the two settings above. Only the coredevice gateware/firmware was updated. The faulty behavior persists when doing TCP requests instead of ICMP ones.

Given the captured ARP requests, it seems that the gateway is not configured properly on the coredevice. This faulty behavior is unchanged when using ip=use_dhcp.

If this is the issue, is there a way to set the gateway in the static IP case? For the DHCP case, I would expect the gateway to be broadcast by the DHCP server and set accordingly.

Your System (omit irrelevant parts)

  • Operating System: n/a
  • ARTIQ version: n/a
  • Version of the gateware and runtime loaded in the core device: 7.0.06ad76b.beta and 7.0.d17675e.beta
  • Hardware involved: Kasli v1.1
@airwoodix airwoodix changed the title Cannot access coredevice across subnets after c60de48 Cannot access coredevice across subnets after DHCP feature merge Jul 6, 2022
@sbourdeauducq
Copy link
Member

@mbirtwell

@mbirtwell
Copy link
Contributor

I'll try and take a look at this today or tomorrow.

@sbourdeauducq
Copy link
Member

Reverted for release-7

@sbourdeauducq sbourdeauducq added this to the ARTIQ-8 milestone Jul 8, 2022
@mbirtwell
Copy link
Contributor

So it seems like this wasn't really intended to be supported by smoltcp. smoltcp used to have a feature where it would fill the neighbour cache from any packet that it saw to try and avoid unnecessary ARPs. But that was removed because it caused problems if there were certain buggy devices also on the network. See commit and PR.

The artiq firmware configures the IP address with 0 prefix bits. Effectively claiming that we're on the same sub-net as the entire internet. Which when coupled with the above smoltcp feature meant that every packet received would add an entry to the neighbour cache even if they weren't strictly speaking neighbours. So a packet that had been routed on to a subnet from another subnet would result in a neighbour cache entry mapping the origins IP to the routers MAC address. Again not strictly correct, but good enough to make this work in your case.

So options are:

  • Ask smoltcp if we can have the automatic neighbour cache population back again, might be possible with some extra filtering of the candidates like requiring it to have a unicast destination address. Or only doing it for packet that are addressed to us. I'll raise an issue on smoltcp.
  • Adding default route support. This'll still break for people on upgrade if they don't set a default route, but at least they can then do that to fix it. It should be easy to have the default route set from DHCP if you're using that.
  • Stay stuck on an old smoltcp version
  • Write this off as not supported. It seems like a bit of an accident that it worked in the first place to me.

That's roughly in order of my personal preference and there's a big gap between options 2 and 3.

@Dirbaio
Copy link

Dirbaio commented Jul 8, 2022

The artiq firmware configures the IP address with 0 prefix bits.

This is the issue. You should configure the smoltcp device with the correct prefix length: XX.YY.0.137/24. Then, when it wants to send packets to XX.YY.2.5, it'll see it's an out-of-subnet IP and send it to the default route instead (say, XX.YY.0.1), which knows how to route it.

If the default route is not in the ARP cache it'll find it with a Who has XX.YY.0.1? Tell XX.YY.0.137 ARP request. It should never do an ARP request for an out-of-subnet IP like the Who has XX.YY.2.5? Tell XX.YY.0.137 you're seeing now.

@mbirtwell
Copy link
Contributor

Except that there is no default route, nor currently any support for setting one.

@Dirbaio
Copy link

Dirbaio commented Jul 8, 2022

Then you should add it! :)

@airwoodix
Copy link
Contributor Author

Thanks for the investigation! This look promising.

It is a quite fortunate accident that cross-subnet access used to work and is now quite central to our workflow. @mbirtwell do you have capacity to work on a resolution?

@sbourdeauducq
Copy link
Member

@airwoodix You can use release 7 in the meantime.

@jordens
Copy link
Member

jordens commented Jul 26, 2022

The behavior was not at all an accident and smoltcp was explicitly written to support this. It just turned out to be too fragile. @airwoodix and @mbirtwell what are your plans here?

@mbirtwell
Copy link
Contributor

I think doing the work to add the default gateway configuration is the best way forwards. I don't mind doing that, but it's not likely to be soon.

@airwoodix
Copy link
Contributor Author

I also don't really have capacity to work on this at the moment. We track release-7 for now to work around it.

@mbirtwell
Copy link
Contributor

I've managed to make a start of on this: d0fe2c5, but it's not tested yet.

@thomasfire
Copy link
Contributor

Hi @airwoodix , could you please tell (or better draw) the network topology mentioned in this issue (including the router, that connects subnets)?
I'm trying to reproduce the problem, but currently have only managed to create an inner subnet, and the router I used looks like doesn't have the possibility to allow me access inner devices from outside, though I could access outer devices from inner network.

@mbirtwell
Copy link
Contributor

My setup for testing this was something like:

sudo sysctl -w net.ipv4.ip_forward=1
sudo ip netns add client2
sudo ip link add veth5 type veth peer name veth6 netns client2
sudo ip addr add 192.168.71.1/24 dev veth5
sudo ip netns exec client2 ip addr add 192.168.71.2/24 dev veth6
sudo ip netns exec client2 ip link set veth6 up
sudo ip link set veth5 up
sudo ip netns exec client2 ip route add default via 192.168.71.1

Which should set up a new network namespace on your computer, put a new network node in that namespace and setup your computer as a router between that node and the rest of the network you are on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants