Data and SEND_SHUTDOWN won't get sent to the peer #2063

ManickaP · 2021-10-07T16:19:56Z

Describe the bug

This is fairly reproducible in our HTTP/3 stress tests. At around 20 minutes mark (~300 000 requests), a request will fail with timeout. It doesn't seem like we do something wrong, but it still might be our fault. I also collected dump from about 5 seconds after the requests gets stuck and haven't seen anything suspicious in it.

Attached log:
client stream 0x7f18f4010150
server stream 0x7f6b80027090

Request gets send around 02:36:30 and then at 02:36:40 it gets cancelled on a timeout so the streams get shutdown and closed eventually.

This is the filtered msquic.log (full msquic.log):

  142: [1][7193.71d1][02:36:31.488590][strm][0x7f18f4010150] Received ack for 326 bytes, offset=102, FF=0x0
  143: [1][7193.71d1][02:36:31.488590][strm][0x7f18f4010150] SF:0 FC:32768 QS:428 MAX:428 UNA:428 NXT:428 RECOV:0-0
  146: [1][7193.71d1][02:36:31.488591][strm][0x7f18f4010150] Received ack for 0 bytes, offset=428, FF=0x2
  147: [1][7193.71d1][02:36:31.488592][strm][0x7f18f4010150] Send queue completely drained
  148: [1][7193.71d1][02:36:31.488592][strm][0x7f18f4010150] Send State: 5
  149: [1][7193.71d1][02:36:31.488593][strm][0x7f18f4010150] Indicating QUIC_STREAM_EVENT_SEND_SHUTDOWN_COMPLETE
  150: [1][7193.71d1][02:36:31.488621][strm][0x7f18f4010150] SF:0 FC:32768 QS:428 MAX:428 UNA:428 NXT:428 RECOV:0-0
  168: [8][7137.74ef][02:36:31.488652][ api] Enter 22 (0x7f6b80027090).
  175: [5][7137.716c][02:36:31.488666][strm][0x7f6b80027090] Send Request [0x7f6bc0011080] queued with 398 bytes at offset 6 (flags 0x0)
  176: [5][7137.716c][02:36:31.488667][strm][0x7f6b80027090] Indicating QUIC_STREAM_EVENT_SEND_COMPLETE [0x7f6bc0011080]
  177: [5][7137.716c][02:36:31.488678][strm][0x7f6b80027090] SF:16 FC:32768 QS:404 MAX:0 UNA:0 NXT:0 RECOV:0-0
  181: [9][7137.742d][02:36:31.488793][ api] Enter 21 (0x7f6b80027090).
  188: [1][7137.716c][02:36:31.488807][strm][0x7f6b80027090] Setting flags 0x64 (existing flags: 0x16)
  190: [1][7137.716c][02:36:31.488809][strm][0x7f6b80027090] SF:80 FC:32768 QS:404 MAX:0 UNA:0 NXT:0 RECOV:0-0
  191: [1][7137.716c][02:36:31.488810][strm][0x7f6b80027090] Send State: 4

Also I'm running debug build of msquic.

Affected OS

Additional OS information

Arch Linux kernel 5.14
But seeing the same timeout in our stress pipeline - Debian bullseye docker image - as well.

MsQuic version

main

Steps taken to reproduce bug

Run dotnet/runtime HTTP/3 stress tests for about 20 minutes and you'll see HTTP request failing with timeout.

Expected behavior

Client to receive the data and SEND_SHUTDOWN from the server.

Actual outcome

It seems like server tries to send the data and shutdown its sending side, but it never gets reported on the client side (no RECEIVE event, no PEER_SEND_SHUTDOWN).

Additional details

No response

The text was updated successfully, but these errors were encountered:

nibanks · 2021-10-07T16:34:42Z

I see the data queued onto the server (server side) but it's never put into a packet. This likely means the stream is blocked, but I don't see all the events for the stream back to when it was created. We should have at least one of these events:

        QuicTraceEvent(
            StreamOutFlowBlocked,
            "[strm][%p] Send Blocked Flags: %hhu",
            Stream,
            Stream->OutFlowBlockedReasons);

Do you still have all the logs?

ManickaP · 2021-10-07T16:38:04Z

msquic.1508.log
It starts with client side stream creation.

I still have them all, from the whole run, but it's 10s of GBs of data.

nibanks · 2021-10-07T16:43:08Z

Ok. So the stream isn't blocked, but the connection is, on FC:

170 [5][7137.716c][02:36:30.486719][conn][0x7f6b90000b60] OUT: BytesSent=933673264 InFlight=0 InFlightMax=14378 CWnd=28756 SSThresh=20129 ConnFC=0 ISB=131072 PostedBytes=0 SRtt=115

Notice the ConnFC=0 there. This means you must have a lot of data pending waiting for your receive path to drain.

ManickaP · 2021-10-07T17:46:40Z

This means you must have a lot of data pending waiting for your receive path to drain.

You probably mean in general on the connection level? Not for this particular stream?

nibanks · 2021-10-07T17:47:29Z

You probably mean in general on the connection level? Not for this particular stream?

Correct.

ManickaP · 2021-10-07T17:49:35Z

So I assume that msquic expect the app to drain all data from all streams before it shuts them down and closes them? Even if we abort reads?

nibanks · 2021-10-07T17:51:14Z

So I assume that msquic expect the app to drain all data from all streams before it shuts them down and closes them? Even if we abort reads?

No, if you abort the stream receive path, then you don't have to drain the receives.

ManickaP · 2021-10-07T18:00:20Z

We're aborting reads from stream Dispose in case we didn't reach RECEIVE event with FIN flag:
https://github.com/dotnet/runtime/blob/b36548cae1669486a24ddecf3e491d946989ead2/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs#L828

So we shouldn't have any data pending on the connection. We also don't have any pending (undisposed) streams on the connection AFAIK.

Do you know of any other scenario/behavior that might cause this?

nibanks · 2021-10-07T18:05:09Z

I'm double checking the code and will get back to you shortly.

nibanks · 2021-10-07T18:09:57Z

Well, uggg. It looks like you found another bug around receive abort. We're not correctly updating the connection-wide FC if the app aborts the stream receive without completing the receive first. I think I can quickly have a fix for you to test though.

nibanks added this to the Nickel milestone Oct 7, 2021

nibanks added the Bug: App A bug in the application above MsQuic label Oct 7, 2021

nibanks added Bug: Core A code bug in the Core MsQuic code and removed Bug: App A bug in the application above MsQuic labels Oct 7, 2021

ManickaP mentioned this issue Oct 7, 2021

[HTTP/3] Stress test regression dotnet/runtime#56310

Closed

nibanks mentioned this issue Oct 7, 2021

Fix Connection FC Handling on Stream Abort #2066

Merged

nibanks closed this as completed in #2066 Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data and SEND_SHUTDOWN won't get sent to the peer #2063

Data and SEND_SHUTDOWN won't get sent to the peer #2063

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

nibanks commented Oct 7, 2021

Data and SEND_SHUTDOWN won't get sent to the peer #2063

Data and SEND_SHUTDOWN won't get sent to the peer #2063

Comments

ManickaP commented Oct 7, 2021

Describe the bug

Affected OS

Additional OS information

MsQuic version

Steps taken to reproduce bug

Expected behavior

Actual outcome

Additional details

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

ManickaP commented Oct 7, 2021

nibanks commented Oct 7, 2021

nibanks commented Oct 7, 2021