-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSO severely degrades connection performance #4394
Comments
Minutes? That's incredibly long! How many pings per second are you sending? I can't really think of why, but there might also be a problem in quic-go's GSO code path: Line 1920 in d540f54
We use a different code path to better make use of batching. Maybe there's a subtle bug somewhere there? A qlog might be helpful, depending on where the problem lies. It would definitely be a good starting point. Would you mind recording one and posting it here? |
When I hit 'enter', it opens a single stream and writes the bytes I'll gather logs and share them here shortly. |
I ran the program to capture qlogs. Though one is technically the server that was listening for the incoming connection, they're really two peers. I initiated two pings from the "server"
I also initiated two pings from the "client"
Here are the logs: Unfortunately, my SSH connection to the two machines timed out while I walked away from the computer. Hopefully that didn't introduce garbage into the end of the logs. I'll try to instrument the code tomorrow see in what functions the program is spending its time. If you have any other things you want me to try, please don't hesitate to ask. |
If you load the qlogs into qvis, you can see that there's an absolutely massive amount of packet loss happening. No wonder that things are slow! Now the question is what causes this packet loss. It would probably be helpful if you could capture a tcpdump on both machines, so we can see which packets are actually going over the wire. It might also be interesting to add some logging in Lines 85 to 92 in d540f54
|
Hopefully I captured enough of what you needed. I modified func (c *sconn) writePacket(p []byte, addr net.Addr, oob []byte, gsoSize uint16, ecn protocol.ECN) error {
slog.Info("writePacket", "hexP", hex.EncodeToString(p), "addr", addr.String(), "hexOOB", hex.EncodeToString(oob), "gsoSize", gsoSize, "ecn", ecn)
_, err := c.WritePacket(p, addr, oob, gsoSize, ecn)
if err != nil && !c.wroteFirstPacket && isPermissionError(err) {
slog.Info("writePacket error path", "err", err)
_, err = c.WritePacket(p, addr, oob, gsoSize, ecn)
}
c.wroteFirstPacket = true
return err
} Then I re-ran the test with tcpdump running on each machine, in addition to setting tcpdump on the client:
And its pcap file: https://soda.link/tcpdump-client.pcap tcpdump on the server
And its pcap file: https://soda.link/tcpdump-server.pcap |
@arashpayan Sorry for the late response. I'm busy preparing the v0.43 release. I'll take a look at this once the release is out, hopefully next week. |
Fixed by #4456, I guess? |
Possibly. And was one of the versions that was buggy, if I remember correctly. @arashpayan Could you rerun your experiment on a more recent (>= 5) kernel version? |
Kernel 5.x is only available on RHEL 9 as far as I'm aware. |
Unfortunately, I don't have any administrative control over the environment where I gathered those logs. I was allowed to gather them to debug it, but the system will be stuck with RHEL 8.1, kernel 4.18 until someone else decides to upgrade. I think it's probably safe to close this issue though. If I end up being able to reproduce it in the same environment with a kernel version >= 5, I'll comment here. |
Thank you @arashpayan! |
I found a bug in an environment where all communication between machines was taking hundreds of thousands of times longer than expected. After some invetigation, I created a small program to reproduce the issue (https://github.com/cloudsoda/quic-go-gso-bug) and determined that the regression was introduced with v0.38.0 (the GSO release). The same program compiled with v0.37.6 of quic-go works just fine. I was able to prove it's GSO related because when using any release including and after v0.38.0 the program works fine if you pass
QUIC_GO_DISABLE_GSO=true
.To be clear, this doesn't happen everywhere. Here's the config of the machines in the environment where it does occur.
Configuration:
Distribution: Red Hat Enterprise Linux
Version: 8.1
Kernel: 4.18.0-147.el8.x86_64
CPU: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (
Reproduction steps
build.sh
script to generate a statically linked binary namedflubber
../flubber server
./flubber client ip.address.of.server:4080
Expected
On machines plugged into a switch right next to each other, you should see a roundtrip time on the order of hundreds of microseconds.
Actual
The roundtrip of the ping is many seconds to many minutes. Below is sample output from two machines each running the program compiled with quic-go from the HEAD of master,
d540f545b0b70217220eb0fbd5278ece436a7a20
.The server
The client
If the program is started with
QUIC_GO_DISABLE_GSO=true
on just one side of the connection, the ping times are drastically reduced to ~30ms. That's still slow, but better. If both sides are started withQUIC_GO_DISABLE_GSO=true
, the ping times are in the expected range of hundreds of microseconds.NOTE
I wasn't sure if qlogs would be helpful in this case, but if they would be, I am happy to generate them.
Thank you for your time.
The text was updated successfully, but these errors were encountered: