Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data and SEND_SHUTDOWN won't get sent to the peer #2063

Closed
1 of 6 tasks
ManickaP opened this issue Oct 7, 2021 · 10 comments · Fixed by #2066
Closed
1 of 6 tasks

Data and SEND_SHUTDOWN won't get sent to the peer #2063

ManickaP opened this issue Oct 7, 2021 · 10 comments · Fixed by #2066
Labels
Bug: Core A code bug in the Core MsQuic code
Milestone

Comments

@ManickaP
Copy link
Member

ManickaP commented Oct 7, 2021

Describe the bug

This is fairly reproducible in our HTTP/3 stress tests. At around 20 minutes mark (~300 000 requests), a request will fail with timeout. It doesn't seem like we do something wrong, but it still might be our fault. I also collected dump from about 5 seconds after the requests gets stuck and haven't seen anything suspicious in it.

Attached log:
client stream 0x7f18f4010150
server stream 0x7f6b80027090

Request gets send around 02:36:30 and then at 02:36:40 it gets cancelled on a timeout so the streams get shutdown and closed eventually.

This is the filtered msquic.log (full msquic.log):

  142: [1][7193.71d1][02:36:31.488590][strm][0x7f18f4010150] Received ack for 326 bytes, offset=102, FF=0x0
  143: [1][7193.71d1][02:36:31.488590][strm][0x7f18f4010150] SF:0 FC:32768 QS:428 MAX:428 UNA:428 NXT:428 RECOV:0-0
  146: [1][7193.71d1][02:36:31.488591][strm][0x7f18f4010150] Received ack for 0 bytes, offset=428, FF=0x2
  147: [1][7193.71d1][02:36:31.488592][strm][0x7f18f4010150] Send queue completely drained
  148: [1][7193.71d1][02:36:31.488592][strm][0x7f18f4010150] Send State: 5
  149: [1][7193.71d1][02:36:31.488593][strm][0x7f18f4010150] Indicating QUIC_STREAM_EVENT_SEND_SHUTDOWN_COMPLETE
  150: [1][7193.71d1][02:36:31.488621][strm][0x7f18f4010150] SF:0 FC:32768 QS:428 MAX:428 UNA:428 NXT:428 RECOV:0-0
  168: [8][7137.74ef][02:36:31.488652][ api] Enter 22 (0x7f6b80027090).
  175: [5][7137.716c][02:36:31.488666][strm][0x7f6b80027090] Send Request [0x7f6bc0011080] queued with 398 bytes at offset 6 (flags 0x0)
  176: [5][7137.716c][02:36:31.488667][strm][0x7f6b80027090] Indicating QUIC_STREAM_EVENT_SEND_COMPLETE [0x7f6bc0011080]
  177: [5][7137.716c][02:36:31.488678][strm][0x7f6b80027090] SF:16 FC:32768 QS:404 MAX:0 UNA:0 NXT:0 RECOV:0-0
  181: [9][7137.742d][02:36:31.488793][ api] Enter 21 (0x7f6b80027090).
  188: [1][7137.716c][02:36:31.488807][strm][0x7f6b80027090] Setting flags 0x64 (existing flags: 0x16)
  190: [1][7137.716c][02:36:31.488809][strm][0x7f6b80027090] SF:80 FC:32768 QS:404 MAX:0 UNA:0 NXT:0 RECOV:0-0
  191: [1][7137.716c][02:36:31.488810][strm][0x7f6b80027090] Send State: 4

Also I'm running debug build of msquic.

Affected OS

  • Windows Server 2022
  • Windows 11
  • Windows Insider Preview (specify affected build below)
  • Ubuntu
  • Debian
  • Other (specify below)

Additional OS information

Arch Linux kernel 5.14
But seeing the same timeout in our stress pipeline - Debian bullseye docker image - as well.

MsQuic version

main

Steps taken to reproduce bug

Run dotnet/runtime HTTP/3 stress tests for about 20 minutes and you'll see HTTP request failing with timeout.

Expected behavior

Client to receive the data and SEND_SHUTDOWN from the server.

Actual outcome

It seems like server tries to send the data and shutdown its sending side, but it never gets reported on the client side (no RECEIVE event, no PEER_SEND_SHUTDOWN).

Additional details

No response

@nibanks nibanks added this to the Nickel milestone Oct 7, 2021
@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

I see the data queued onto the server (server side) but it's never put into a packet. This likely means the stream is blocked, but I don't see all the events for the stream back to when it was created. We should have at least one of these events:

        QuicTraceEvent(
            StreamOutFlowBlocked,
            "[strm][%p] Send Blocked Flags: %hhu",
            Stream,
            Stream->OutFlowBlockedReasons);

Do you still have all the logs?

@ManickaP
Copy link
Member Author

ManickaP commented Oct 7, 2021

msquic.1508.log
It starts with client side stream creation.

I still have them all, from the whole run, but it's 10s of GBs of data.

@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

Ok. So the stream isn't blocked, but the connection is, on FC:

170 [5][7137.716c][02:36:30.486719][conn][0x7f6b90000b60] OUT: BytesSent=933673264 InFlight=0 InFlightMax=14378 CWnd=28756 SSThresh=20129 ConnFC=0 ISB=131072 PostedBytes=0 SRtt=115

Notice the ConnFC=0 there. This means you must have a lot of data pending waiting for your receive path to drain.

@nibanks nibanks added the Bug: App A bug in the application above MsQuic label Oct 7, 2021
@ManickaP
Copy link
Member Author

ManickaP commented Oct 7, 2021

This means you must have a lot of data pending waiting for your receive path to drain.

You probably mean in general on the connection level? Not for this particular stream?

@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

You probably mean in general on the connection level? Not for this particular stream?

Correct.

@ManickaP
Copy link
Member Author

ManickaP commented Oct 7, 2021

So I assume that msquic expect the app to drain all data from all streams before it shuts them down and closes them? Even if we abort reads?

@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

So I assume that msquic expect the app to drain all data from all streams before it shuts them down and closes them? Even if we abort reads?

No, if you abort the stream receive path, then you don't have to drain the receives.

@ManickaP
Copy link
Member Author

ManickaP commented Oct 7, 2021

We're aborting reads from stream Dispose in case we didn't reach RECEIVE event with FIN flag:
https://github.com/dotnet/runtime/blob/b36548cae1669486a24ddecf3e491d946989ead2/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs#L828

So we shouldn't have any data pending on the connection. We also don't have any pending (undisposed) streams on the connection AFAIK.

Do you know of any other scenario/behavior that might cause this?

@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

I'm double checking the code and will get back to you shortly.

@nibanks nibanks added Bug: Core A code bug in the Core MsQuic code and removed Bug: App A bug in the application above MsQuic labels Oct 7, 2021
@nibanks
Copy link
Member

nibanks commented Oct 7, 2021

Well, uggg. It looks like you found another bug around receive abort. We're not correctly updating the connection-wide FC if the app aborts the stream receive without completing the receive first. I think I can quickly have a fix for you to test though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug: Core A code bug in the Core MsQuic code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants