Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUIC: MsQuicListener can randomly fail to accept new connection #55979

Closed
wfurt opened this issue Jul 20, 2021 · 21 comments · Fixed by #100947
Closed

QUIC: MsQuicListener can randomly fail to accept new connection #55979

wfurt opened this issue Jul 20, 2021 · 21 comments · Fixed by #100947
Assignees
Labels
area-System.Net.Quic in-pr There is an active PR which will close this issue when it is merged test-enhancement Improvements of test source code
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Jul 20, 2021

That can be obvious with

System.Net.Quic.QuicException : Connection has been shutdown by transport. Error Code: CONNECTION_REFUSED

however there are more hidden implications:
#55642
we have test fragments like

            ValueTask clientTask = clientConnection.ConnectAsync();
            using QuicConnection serverConnection = await listener.AcceptConnectionAsync();
            await clientTask;

if the connect fails, the AcceptConnectionAsync() would hang forever since the clientTask is not awaited.
There are other reasons why we hang and @stephentoub is helping with the other patterns.

as @ManickaP pointed in #53224
https://github.com/microsoft/msquic/blob/3898b2b88085a478eeb844885ea78daa9b060d06/src/core/stream.c#L207-L211

    if (QuicConnIsClosed(Stream->Connection) ||
        Stream->Flags.Started) {
        Status = QUIC_STATUS_INVALID_STATE;
        goto Exit;
    }

That check is QUIC protocol state. If we get close from peer, there is race condition there.
I correlate this with packet capture and I can that the failed stream/connection got REFUSED message as well.
test-refused.pcapng.zip

This is unpleasant as it is difficult to hook any retry logic to INVALID_STATE.

There may be more to it but this is partially caused by (MS?)QUIC design.
I originally thought there may be some race condition starting the listener but clearly the message is well-formate QUIC.
When I debug the failures, there would be NO callback on the MsQuicListener and the logic happens inside msquic.
@nibanks pointed me to QuicWorkerIsOverloaded and when this is true, listener would refuse to take up more work e.g. accept new connections. There may be other reasons why msquic would refuse new connection.

This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 20, 2021
@ghost
Copy link

ghost commented Jul 20, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

That can be obvious with

System.Net.Quic.QuicException : Connection has been shutdown by transport. Error Code: CONNECTION_REFUSED

however there are more hidden implications:
#55642
we have test fragments like

            ValueTask clientTask = clientConnection.ConnectAsync();
            using QuicConnection serverConnection = await listener.AcceptConnectionAsync();
            await clientTask;

if the connect fails, the AcceptConnectionAsync() would hang forever since the clientTask is not awaited.
There are other reasons why we hang and @stephentoub is helping with the other patterns.

as @ManickaP pointed in #53224
https://github.com/microsoft/msquic/blob/3898b2b88085a478eeb844885ea78daa9b060d06/src/core/stream.c#L207-L211

    if (QuicConnIsClosed(Stream->Connection) ||
        Stream->Flags.Started) {
        Status = QUIC_STATUS_INVALID_STATE;
        goto Exit;
    }

That check is QUIC protocol state. If we get close from peer, there is race condition there.
I correlate this with packet capture and I can that the failed stream/connection got REFUSED message as well.
test-refused.pcapng.zip

This is unpleasant as it is difficult to hook any retry logic to INVALID_STATE.

There may be more to it but this is partially caused by (MS?)QUIC design.
I originally thought there may be some race condition starting the listener but clearly the message is well-formate QUIC.
When I debug the failures, there would be NO callback on the MsQuicListener and the logic happens inside msquic.
@nibanks pointed me to QuicWorkerIsOverloaded and when this is true, listener would refuse to take up more work e.g. accept new connections. There may be other reasons why msquic would refuse new connection.

This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.

Author: wfurt
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@karelz
Copy link
Member

karelz commented Jul 20, 2021

Triage: Crank up the threshold limit higher for testing.

@karelz karelz added test-bug Problem in test source code (most likely) and removed untriaged New issue has not been triaged by the area owner labels Jul 20, 2021
@karelz karelz added this to the 6.0.0 milestone Jul 20, 2021
@wfurt
Copy link
Member Author

wfurt commented Jul 25, 2021

Aside from the CPU check in QuicWorkerIsOverloaded, I see many cases when we hit

if (Connection->ReceiveQueueCount >= QUIC_MAX_RECEIVE_QUEUE_COUNT) {
..
QuicPacketLogDrop(Connection, CxPlatDataPathRecvDataToRecvPacket(Datagram), "Max queue limit reached");

this may be related to existing tests - they are weird in a sense that we send lot of 1-2 bytes chunks so it seems easy or overrun the 180 limit. I wish this is something we could tune up for the test runs @nibanks but it seems to be compile time constant.

@nibanks
Copy link

nibanks commented Jul 25, 2021

IMHO, this is not particularly interesting. When you run things for long enough at max CPU queues build up. You cannot queue unlimited. You will continue to hit issues like this. Dropped packets are not fatal. They will get retransmitted.

@wfurt
Copy link
Member Author

wfurt commented Jul 26, 2021

Point taken. We have quite a few randomly failing tests in CI - something not trivial to investigate.
I assume we can get to the counters, right? Is there something new can look at to know for sure why given connection failed? (assuming we have access to the Listener and client connection)

@wfurt
Copy link
Member Author

wfurt commented Jul 26, 2021

I put assert to QuicWorkerIsOverloaded and I can see it hit - took ~ 200 iterations in this case.

(lldb) target create "../../testhost/net6.0-Linux-Debug-x64/dotnet" --core "core"
Core file '/home/furt/github/wfurt-runtime/artifacts/bin/System.Net.Quic.Functional.Tests/net6.0-Unix-Debug/core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGABRT
  * frame #0: 0x00007fb949b5518b libc.so.6`__GI_raise(sig=2) at raise.c:51
    frame #1: 0x00007fb949b3492e libc.so.6`__GI_abort at abort.c:100
    frame #2: 0x00007fb8cdf6715b libmsquic.so`quic_bugcheck at platform_posix.c:61
    frame #3: 0x00007fb8cdf0a78d libmsquic.so`QuicRegistrationAcceptConnection at worker.h:118
    frame #4: 0x00007fb8cdf0a74c libmsquic.so`QuicRegistrationAcceptConnection(Registration=<unavailable>, Connection=<unavailable>) at registration.c:356
    frame #5: 0x00007fb8cdf07154 libmsquic.so`QuicListenerAcceptConnection(Listener=0x00007fb854006410, Connection=0x00007fb8840070a0, Info=0x00007fb8377fd9c0) at listener.c:557
    frame #6: 0x00007fb8cdf20566 libmsquic.so`QuicBindingAcceptConnection(Binding=<unavailable>, Connection=0x00007fb8840070a0, Info=0x00007fb8377fd9c0) at binding.c:555
    frame #7: 0x00007fb8cdf37f30 libmsquic.so`QuicCryptoProcessData(Crypto=0x00007fb8840078c0, IsClientInitial=<unavailable>) at crypto.c:1735
    frame #8: 0x00007fb8cdf383e6 libmsquic.so`QuicCryptoProcessFrame(Crypto=0x00007fb8840078c0, KeyType=<unavailable>, Frame=<unavailable>) at crypto.c:1250
    frame #9: 0x00007fb8cdf2e841 libmsquic.so`QuicConnRecvFrames(Connection=0x00007fb8840070a0, Path=<unavailable>, Packet=0x00007fb854053410, ECN=<unavailable>) at connection.c:4199
    frame #10: 0x00007fb8cdf313e7 libmsquic.so`QuicConnRecvDatagramBatch(Connection=0x00007fb8840070a0, Path=<unavailable>, BatchCount=<unavailable>, Datagrams=0x00007fb8377fdcc0, Cipher=<unavailable>, RecvState=<unavailable>) at connection.c:5102
    frame #11: 0x00007fb8cdf31dca libmsquic.so`QuicConnRecvDatagrams(Connection=0x00007fb8840070a0, DatagramChain=<unavailable>, DatagramChainCount=<unavailable>, IsDeferred=<unavailable>) at connection.c:5294
    frame #12: 0x00007fb8cdf322dd libmsquic.so`QuicConnFlushRecv(Connection=0x00007fb8840070a0) at connection.c:5438
    frame #13: 0x00007fb8cdf34c08 libmsquic.so`QuicConnDrainOperations(Connection=0x00007fb8840070a0) at connection.c:6819
    frame #14: 0x00007fb8cdf191a5 libmsquic.so`QuicWorkerProcessConnection(Worker=0x00007fb85c0380a0, Connection=0x00007fb8840070a0) at worker.c:488
    frame #15: 0x00007fb8cdf197d6 libmsquic.so`QuicWorkerThread(Context=0x00007fb85c0380a0) at worker.c:576
    frame #16: 0x00007fb94a05c609 libpthread.so.0`start_thread + 217
    frame #17: 0x00007fb949c31293 libc.so.6`__GI___clone at clone.S:95

@karelz
Copy link
Member

karelz commented Jul 27, 2021

The impact on our tests is larger and larger, affecting entire runtime repo. We need to either find a way to disable the behavior in msquic during our tests, or we have to find a workaround to not hit it in our tests (e.g. run msquic tests serially). We need some solution ASAP (in a day or two).

As an example: Even our BasicTest in #56269 is failing too often and we should disable it in CI due to the noise it is causing.

@wfurt
Copy link
Member Author

wfurt commented Aug 5, 2021

We talk about it more today @nibanks and just to confirm and clarify. When MsQuic feels it is overloaded it would send back packages with notification instead of just ignoring the request and waiting for re-transmit, right?

And is there any other reason why client may get CONNECTION_REFUSED error?
I'm primarily asking if this is something we can depend on for test connection retry or if there is any better way how detect this situation.

@nibanks
Copy link

nibanks commented Aug 5, 2021

In this case, MsQuic should inform the peer with the CONNECTION_REFUSED error. There shouldn't be any other reasons that you'd see this error from an MsQuic server. You can retry if you want. But you're just going to be adding more load to the system and there's no guarantee you'll succeed any time soon. If you have enough parallel tests all "retrying until they succeed" it's possible you never get there. Perhaps if you have "sleep and retry" model it's work better.

@wfurt
Copy link
Member Author

wfurt commented Aug 5, 2021

I see no evidence that the system would be generally overloaded. When I run it locally on my vm I see 20-80% cpu utilization. There may be spikes and with docker and virtualization there are no real-time guarantees.

At this moment this is biggest reason. for CI instability impacting all .NET contributors. There are other factors like occasional DNS failures or outages in external Azure services. But all the others are quite rare so we live with them.
Also we are yet to run tests on ARM and some of those machines are SLOW.

If we can move this to the occasional but rare category it would be still big improvement. That may be wait and retry, increasing the threshold or some other strategy. For example could MsQuic give us event so we know that particular connection is going to fail?

@ManickaP
Copy link
Member

Triage: partial fix in PR. We also want to disable parallelism of QUIC tests to minimize the probability of this issue, after all that's what msquic does in their test suite.
Merge the PR in flight, get in the PR with disabled parallelism and close this.
If it comes back, we'd need to disable tests more aggressively since there's no easy remedy to this problem.

@karelz karelz added test-enhancement Improvements of test source code and removed enhancement Product code improvement that does NOT require public API changes/additions labels Aug 2, 2022
@karelz karelz modified the milestones: 7.0.0, 8.0.0 Aug 2, 2022
@karelz karelz unassigned wfurt Aug 2, 2022
@ManickaP
Copy link
Member

Triage: there's nothing we can do about the root cause on our side. We could expose counters with the values of msquic statistics (either private or public). Either way, this is diagnostics and not critical.

@rzikm
Copy link
Member

rzikm commented Feb 28, 2024

Now that #98361 is merged, this issue should not be an issue anymore (unless the machine is truly overloaded). Let's keep this issue open until we can validate this when MsQuic 2.4 gets released and the CI images get updated.

@wfurt
Copy link
Member Author

wfurt commented Feb 28, 2024

Are we brave enough to run tests in parallel again?

@rzikm
Copy link
Member

rzikm commented Feb 28, 2024

Are we brave enough to run tests in parallel again?

I would, try it, once all runners have MsQuic 2.4 (the cert validation does not run async on lower versions because of microsoft/msquic#4132)

@nibanks
Copy link

nibanks commented Feb 28, 2024

Are we brave enough to run tests in parallel again?

I would, try it, once all runners have MsQuic 2.4 (the cert validation does not run async on lower versions because of microsoft/msquic#4132)

Should we take that fix to v2.3? Or do you not care, because you're not actually using v2.3?

@rzikm
Copy link
Member

rzikm commented Feb 28, 2024

We are running 2.3 on CI, so porting it would be nice. Thanks in advance.

@nibanks
Copy link

nibanks commented Feb 28, 2024

We are running 2.3 on CI, so porting it would be nice. Thanks in advance.

Ok. I will try to get around to it this week. If you want, feel free to make a PR to release/2.3 to cherry-pick the PR.

@rzikm
Copy link
Member

rzikm commented Feb 29, 2024

microsoft/msquic#4164

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Quic in-pr There is an active PR which will close this issue when it is merged test-enhancement Improvements of test source code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants