Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTTP/3] SIGABRT in stress tests #72696

Closed
CarnaViire opened this issue Jul 22, 2022 · 13 comments · Fixed by #74669
Closed

[HTTP/3] SIGABRT in stress tests #72696

CarnaViire opened this issue Jul 22, 2022 · 13 comments · Fixed by #74669
Assignees
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@CarnaViire
Copy link
Member

Today's stress test run crashed with segmentation fault after 28 mins https://dev.azure.com/dnceng/public/_build/results?buildId=1897576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1570

Funny enough, this run was scheduled with this ed5aa3d commit on top 😄 @rzikm (might be totally unrelated)

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 22, 2022
@ghost
Copy link

ghost commented Jul 22, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Today's stress test run crashed with segmentation fault after 28 mins https://dev.azure.com/dnceng/public/_build/results?buildId=1897576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1570

Funny enough, this run was scheduled with this ed5aa3d commit on top 😄 @rzikm (might be totally unrelated)

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@ManickaP ManickaP self-assigned this Jul 25, 2022
@ManickaP ManickaP removed the untriaged New issue has not been triaged by the area owner label Jul 25, 2022
@ManickaP ManickaP added this to the 7.0.0 milestone Jul 25, 2022
@ManickaP
Copy link
Member

I'm not able to reproduce it and the crash dumps from the pipeline are useless since the native is in release build.
I'm keeping the draft PR open if anyone want to have a go on this. It builds msquic in Debug.

@ManickaP ManickaP removed their assignment Jul 27, 2022
@karelz
Copy link
Member

karelz commented Aug 9, 2022

@CarnaViire can you please write down when was the first hit, and how often it happens / happened?

@CarnaViire
Copy link
Member Author

CarnaViire commented Aug 9, 2022

It turned out that only the first occurrence was a segfault (exit code 139), all others are sigabrt (exit code 134). So it is still happening and it was not fixed by Mana's copying.

7/18-8/9 we've had ~32 not-crashing runs (30 min) and 5 6 crashing runs most of them gathered around 7/22 (see below)

Date Link Exit code Crashing after
7/22 Run #20220722.1 139 28 min
7/22 Run #20220722.3 134 1 min
7/23 Run #20220723.2 134 2 min
8/3 Run #20220803.6 134 8 min
8/5 Run #20220805.1 134 13 min
8/9 Run #20220809.3 134 7 min
UPD: new occurrences since 8/9
Date Link Exit code Crashing after
8/12 Run #20220812.4 134 23 min
8/12 Run #20220812.7 134 13 min

@wfurt
Copy link
Member

wfurt commented Aug 9, 2022

sigabrt is likely coming from Assert or unhandled exception. That should be visible IMHO from the dump.

@CarnaViire CarnaViire self-assigned this Aug 10, 2022
@CarnaViire CarnaViire changed the title [HTTP/3] Segmentation fault in stress tests [HTTP/3] SIGABRT in stress tests Aug 12, 2022
@CarnaViire
Copy link
Member Author

@CarnaViire
Copy link
Member Author

Update: SIGABRT is most likely a result of the native heap corruption. pthread_mutex_lock returns EINVAL meaning "The value specified by mutex does not refer to an initialized mutex object."

Address Sanitizer has caught heap-use-after-free for .NET threads for Send buffers (they are allocated in native memory).

==47659==ERROR: AddressSanitizer: heap-use-after-free on address 0x60200267b510 at pc 0x7ebf82297fd3 bp 0x7ebf7e6aa9f0 sp 0x7ebf7e6aa9e0
READ of size 4 at 0x60200267b510 thread T16
    #0 0x7ebf82297fd2 in QuicStreamSendBufferRequest /home/lia/dev/git/msquic/src/core/stream_send.c:449
    #1 0x7ebf82391c33 in QuicSendBufferFill /home/lia/dev/git/msquic/src/core/send_buffer.c:181
    #2 0x7ebf8229bf67 in QuicStreamSendFlush /home/lia/dev/git/msquic/src/core/stream_send.c:594
    #3 0x7ebf82307502 in QuicConnProcessApiOperation /home/lia/dev/git/msquic/src/core/connection.c:7205
    #4 0x7ebf82307fe2 in QuicConnDrainOperations /home/lia/dev/git/msquic/src/core/connection.c:7340
    #5 0x7ebf822afd82 in QuicWorkerProcessConnection /home/lia/dev/git/msquic/src/core/worker.c:510
    #6 0x7ebf822b1342 in QuicWorkerLoop /home/lia/dev/git/msquic/src/core/worker.c:668
    #7 0x7ebf822b1d5e in QuicWorkerThread /home/lia/dev/git/msquic/src/core/worker.c:733
    #8 0x7f00a0862608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477
    #9 0x7f00a0433132 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f132)

0x60200267b510 is located 0 bytes inside of 16-byte region [0x60200267b510,0x60200267b520)
freed by thread T194 (.NET ThreadPool) here:
    #0 0x7f00a099d40f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:122
    #1 0x7f001d7d3433  (/usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.0-rc.1.22403.8/System.Private.CoreLib.dll+0xf3433)
    ........

previously allocated by thread T212 (.NET ThreadPool) here:
    #0 0x7f00a099d808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x7f001fe394c6  (/memfd:doublemapper (deleted)+0x96a4c6)
    ........

The same heap corruption most possibly manifests as INVALID_PARAMETER in #73688.

I am investigating further.

@carlossanlop
Copy link
Member

Does this fix meet the bar to get backported to the RC1? One of the backport PRs hit this failure there.

@CarnaViire
Copy link
Member Author

I believe it does @carlossanlop -- this is a significant reliability issue

@CarnaViire
Copy link
Member Author

While I think I have caught all the send buffers related problems (I'll put up a PR shortly), there are still some problems remaining which also result in crashes. Address Sanitizer catches this:

/home/lia/dev/git/msquic/src/inc/quic_platform.h:395:36: runtime error: member access within misaligned address 0x000000000005 for type 'struct CXPLAT_SLIST_ENTRY', which requires 8 byte alignment
0x000000000005: note: pointer points here
<memory cannot be printed>
AddressSanitizer:DEADLYSIGNAL
=================================================================
==19205==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000005 (pc 0x7f1c6792a3f4 bp 0x7f1c5de95010 sp 0x7f1c5de94f30 T61)
==19205==The signal is caused by a READ memory access.
==19205==Hint: address points to the zero page.
    #0 0x7f1c6792a3f3 in CxPlatListPopEntry /home/lia/dev/git/msquic/src/inc/quic_platform.h:395
    #1 0x7f1c6792a3f3 in CxPlatPoolAlloc /home/lia/dev/git/msquic/src/inc/quic_platform_posix.h:521
    #2 0x7f1c6792a3f3 in QuicStreamInitialize /home/lia/dev/git/msquic/src/core/stream.c:35
    #3 0x7f1c6795a2ca in MsQuicStreamOpen /home/lia/dev/git/msquic/src/core/api.c:661
    #4 0x7f5d04298a5e  (/memfd:doublemapper (deleted)+0x219a5e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/lia/dev/git/msquic/src/inc/quic_platform.h:395 in CxPlatListPopEntry

@nibanks do you possibly have any insights/hints on what could have caused this?

@nibanks
Copy link

nibanks commented Aug 24, 2022

My initial guess is that you're calling StreamOpen after ConnectionClose.

@karelz karelz assigned ManickaP and unassigned CarnaViire Aug 25, 2022
@karelz karelz added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) bug labels Aug 25, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 25, 2022
@karelz karelz assigned CarnaViire and unassigned ManickaP Sep 6, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 6, 2022
@CarnaViire
Copy link
Member Author

Reopening for 7.0 backport

@CarnaViire CarnaViire reopened this Sep 6, 2022
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Sep 7, 2022
@karelz
Copy link
Member

karelz commented Sep 8, 2022

Fixed in 8.0 (main) in PR #74669 and in 7.0 (RC2) in PR #75192.

@karelz karelz closed this as completed Sep 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
6 participants