-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver: ethernet to allow jumbo frames 9000 on rpi 4B #5419
Conversation
…ames 9000 on raspberry pi 4 Signed-off-by: Jing Luo <szmun.luoj@gmail.com>
Signed-off-by: Jing Luo <szmun.luoj@gmail.com>
fixup! gpio-fsm: Avoid truncation of delay jiffies
Changing values in a userspace API (uapi) header is definitely not the right solution. Increasing the buffer size in bcmgenet.c may be valid, but I'd then expect it to be a userspace call to increase the MTU on the interface. |
Same comments. What I suspect you intend is to increase
The dependency there on |
Oops, sorry |
So I have been pondering about this thingy for some time. I like the author his "push", and I hope that I am correct to assume it was to trigger a discussion.
Still I think I am assuming a lot here, that it is possible that the Ethernet driver buffer size can be changed on the fly. That the generic infra that I mention isnt something intel specific and that the linux net maintainer is open for such a contribution. But if we dont try we will never know. So if (a big if) we have a upstream patch accepted by the maintainer, would that mean that it could be considered to be merged? |
@delgh1 There's no point in updating this PR / your branch with merge commits from rpi-6.1.y. It is not going to be merged in the current form under any circumstances. Dare not accepted. @stevenhoving Can you point at the framework you're referring to that Intel use? I've already suggested changing
in bcmgenet.h and replacing the IIRC @P33M has worked on bcmgenet in the past and may know of any issues with jumbo frames (or be able to point me at docs). We're almost always happy to backport patches that have been accepted usptream and add useful functionality. The main criteria being that it doesn't require loads of additional patches to add framework support that is missing in the earlier kernel revision. |
My sum total of contributions to this driver are trivial edits to two lines. We don't have documentation other than register definitions (in particular, no architecture reference or programmer's model). |
Ah, OK. Thanks. If the simplistic change works, then throwing it upstream to start the conversation sounds like the best plan then. |
Has anyone done this? Link? @delgh1 - have you generated any benchmark data to support that a larger MTU size adds any benefits? I tested this under the 5.4 series of kernels and found that jumboframes on simple file transfers were much slower compared to the default mtu value. Have you observed any stability regressions? This thread you linked has one report of NIC timeouts that are not automatically recovered/reboot required. Here is another recent thread showing regressions. |
Larger MTU does have benefit for protocols relying on udp and disallowing fragmentation. I did the simplistic changes for a 5.10.y kernel. It did not work for me:
As far as I can see I am the only source of these regression reports. So I would really appreciate if someone could repeat it for a recent kernel. |
You've not given any details of what commit you reverted, fixes applied, or where you're getting a conflict.
How can we repeat something we have no details over? There is a proposal here of potentially how jumbo frame support might be implemented, or at least a suggestion how the development might continue. Testing other unrelated suggestions should be on the forums or an issue thread, not on a PR. |
Well, here are more details:
I think that all working solution with jumbo frames base on 5.4 kernels or similar with crc offloading disabled by default. |
@delgh1 Does you closing this PR mean you no longer have any interest in the issue? |
I've had a quick hack around. The core validates whether the requested MTU is within the range defined by the card (https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L8672). The default is set to ETH_DATA_LEN at https://elixir.bootlin.com/linux/latest/source/net/ethernet/eth.c#L365, therefore the hack was bumping that up too.
to bcmgenet.c in addition to the mods for ENET_MAX_MTU_SIZE and RX_BUF_LENGTH previously discussed. That still doesn't work on 6.1 though, so it'd be interesting to try them on 5.15 if that is supposed to previously have worked with the hack. |
What exactly does not work? Does a MTU 1900 work? I would expect it to work up to nearly 2k. |
I'm using a Netgear GS110TP with max frame size on all ports set to 9000. Set the Pi MTU to 2000. I've got another Asix AX88179 based adapter around, so I'll confirm whether that behaves itself in this setup. |
Port stats on the Netgear switch give a combined TX & RX for packets of >1522. For an AX88179 on a Pi it only increments by 1 per message when pinging the switch with a packet of 1528 bytes (1500 byte payload). Setting up a second Pi with the patches does respond to pings up to That was on 5.15.92. Now retesting with the clean patches on 6.4.1. |
So 6.4.1 with the diff
works up to For >2004 byte packets, the switch port stats again show that the counters increase on tx, but no rx. |
Good Job! This coincidences with my findings when I ported the patch from 5.4 to 5.10. Does ping -s completely check the packet content? With hardware CRC enabled, It seemed to me that the hardware crc engine puts the crc at the wrong place into large enough outging packets. This could be also some misconfiguration. IIRC, I saw this using udp packets. I added a lot of logging code and I found that larger packets get split into multiple dma blocks upon reception. I could avoid this only by disabling receive status blocks. Again, maybe this is a misconfiguration. Finally I decided to roll back the commit introducing unconditional status blocks as a quick fix over concatenating the dma blocks again. I could not solve the hardware crc issue upon sending anyway. I could not guess the correct settings just from the source code. |
side note: https://www.asix.com.tw/en/product/USBEthernet/Super-Speed_USB_Ethernet/AX88179 says
|
I know, although I was reading https://github.com/torvalds/linux/blob/master/drivers/net/usb/ax88179_178a.c#L1283
I'm generally using a ping payload size of 3000. |
So the driver is dropping what it sees as a fragmented packet. The error path at https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/broadcom/genet/bcmgenet.c#L2325 For my 3000 byte payload ping, dma_length_status is 0x08403f80. length 0x840 is 2112. 0x3f80 is DMA_SOP set (0x2000), but not DMA_EOP (0x4000). There is a register "GENET_0_RBUF_PKT_RDY_THLD - RX Buffer Packet Ready Threshold Register" at offset 0x08 which is the threshold for PKT_RDY being asserted, in units of 16bytes. Default is 0x80, which * 16bytes = 2048. Patch fragments
From the limited docs I have, that threshold is common to whether the 64bit status block is present or not, although the 64byte header is meant to be taken into account if present (Should be max 0x7c when using 2kB buffers for 1984byte payload and 64byte header = 2048bytes total. The current driver doesn't do that, but it also has a max mtu of 1518). That makes me wonder over your report that it was working on 5.4 if the status buffers were disabled. I'll create a PR with my cleaned up diffs that give 3840byte support, and then anyone can try them. I'm running out of ideas with regard how to reassemble the fragmented packets for more than 3840 bytes. I'm sure the networking stack has a mechanism to deal with it, but don't know enough about how it functions. There are features of NETIF_F_FRAGLIST and NETIF_F_SG (scatter-gather), but brief reading implies they impact transmit instead of receive. If we can chain the skbs for the received fragments, then that might work. An email to net-dev is probably warranted. |
PR #5534 If you want to try it out, then once CI has built it, |
Now I'm amazed that the orginal patch did work at all, too. Thankfully there are other reports besides mine. Are there other settings related to outgoing large packets that could explain my observation of packet corruption? Did you also test other protocols like xrdp or rcp? I think it is unlikely that ssh uses a lot of jumbo frames I also considered reassembling incoming packets. |
I was largely testing with ping, although I did check with iperf3 (in UDP mode), and I was assuming that one of them was checking the data was correct. I wasn't observing buffers being discarded due to corruption. Seeing as we can tell the hardware that it has 10kB buffers, it feels like we shouldn't need to handle fragmented receives. However it depends on how the hardware handles things when it hits that PKT_RDY threshold - does it switch to a new buffer, or just notifies that data is ready and keep going with the same one? If the latter, then why does it support 16kB buffers in the rings? Those are questions for those who support the hardware as I don't have that information. e1000 is having to mess with passing page pointers into Disabling the status block shouldn't be too difficult as a hack, but I need to move onto other things right now. It also loses all the hardware checksum stuff, so may have a negative effect on performance overall. |
Thank you so far, 6by9. I think it is a huge leap forward anyway. As for the status blocks, I agree. Especially as long as it is unclear why status blocks make any difference. Could you still have a look at your documentation if there is also a tx threshold? |
There is also a TBUF_TXCHK_PKTRDY_THLD register that has almost the same description as RBUF_PKT_RDY_THLD.
Units of 16bytes, default of 0x80. I wonder if reading the status register (instead of looking at the status block) is just a timing thing. Time for a quick hack to read the register if we're in the fragmented packet case, and see if that magically makes things work..... |
Another wild guess: The threshold values and PKT_RDY trigger the DMA engine for incoming packets and the transmit logic for outgoing ones. Assuming that there are extra FIFOs to receive from or send to the wire. However, this does not explain why incoming packets get split into multiple dma buffers at exactly that threshold. What does the documentation say about threshold == 0? Or isn't it allowed? I think I don't get what you mean with the timing thing. I already benchmarked outgoing packets rate and saw more than 50k packets/s and also more than 50k interrupts/s. So less than 20µsec per packet. IIRC larger incoming packets consume multiple dma blocks and the hardware also indicates that it has consumed more than one block.
My guess is that the EOP flag is set in the next dma buffer holding the remaining bytes. Currently, the driver requires both flags in a single dma block. |
Forgot to hit the "comment" button before I left the office. Reading the register when DMA_EOF isn't set does give me a len_stat value with DMA_EOF set. With a
Status block length of 0xf40 would match rx threshold of 0xf0 + 64 (0x40) bytes of status block. So we're getting a packet in, and the driver is kicking it upwards, but it appears that something in the stack is dropping it before it gets to my client app (ping). Presumably it's a checksum failing somewhere as the offload isn't being performed (I believe I'm telling the stack that the hardware hasn't done it) Diff
|
50k 9000byte packets/s would be 3.6Gbit/s based on 8 transmitted bits per useful byte (ie no sync bits) - that's not going to happen on a 1Gbit/s link.
The hardware writes directly into RAM (in 256 byte bursts), and is given 10kB per buffer in the ring. It has no need to move on to the next buffer when it hit the threshold trigger, but I don't see how it can write a valid status block to the start of the overall buffer if it writes linearly. Perhaps it does have an internal buffer for the entire frame in the network block. |
I did the benchmark with a commercial application sending udp packets and increasing packet size. For packets of 1500 bytes, the application cannot keep up with wire speed, the tx ring gets empty and every packet gets an interrupt. As soon as the application can keep up with wire speed (by larger packets), the interrupt rate falls greatly because the hardware issues an interrupt only every 10th packet (IIRC). Your assumption that the status block is written before the end of the packet has been seen on the wire is convincing to me. If there is some limited FIFO in the system, then smaller packets can be received entirely before transmitted to RAM, jumbo frames not. Anyway, from your logging of dma blocks it seems that the incoming packet is not scattered. Only the status block lacks the EOP flag. This does not match my memories but the assumption above. And it is even easier for the driver. |
https://www.spinics.net/lists/netdev/msg919558.html I'll create a new issue to allow tracking of any progress - closed PRs will tend to be ignored. |
https://www.spinics.net/lists/netdev/msg919558.html I've created a new issue to allow tracking of any progress - #5561. Closed PRs will tend to be ignored. |
known issue
Can set MTU to 9000 but cannot receive a packet larger than 2004 bytes
The situation has not changed since last year (Mar 2022):
https://forums.raspberrypi.com/viewtopic.php?t=330976
Only tested on my rpi 4B rev 1.5. I dare you to merge this.