Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/7.0] [QUIC] Fix native crashes and heap corruption via "generated-like" interop #75192

Merged

Conversation

CarnaViire
Copy link
Member

@CarnaViire CarnaViire commented Sep 7, 2022

Backport of #74669 to release/7.0
Fixes #72696

/cc @CarnaViire

Customer Impact

HTTP/3 or QUIC application crashed with either "Aborted" or "Segmentation Fault" due to native heap corruption. The native crash happened in a time frame from several minutes to several hours, depending on how common was the race between Dispose and other QUIC calls (e.g. cancelling/disposing the stream while sending the data) in the user scenario.

The root cause of the native heap corruption was incorrect and unsynchronized usage of native pointers and arrays, which in case of multithreaded access led to use-after-free and other native memory access issues. This eventually led to native heap corruption which manifested as a crash after some time.

There are 2 main parts of the fix:

  1. Wrap the use of the native MsQuic pointers into "generated-like" SafeHandle pattern, which will ensure the disposed/closed handles are not used and handles are not disposed/closed while in use (during 7.0, interop changed from marshalling to unmanaged function pointers, at which point we have lost automatic SafeHandle "safety").
  2. In QuicStream, wrap SendBuffers native array usages into lock to ensure it will not get freed while being used by native code.

Discovered in HTTP/3 stress runs.

Testing

Multiple 10+ hours of general HTTP/3 stress test runs, multiple ~3h runs for targeted stress scenario with high race probability (POST Duplex Dispose with cancel rate 100%).
Before the fix, the issue would almost always manifest in ~1h timeframe for general HTTP/3 stress test run, and in ~5min for targeted stress scenario.

Risk

Low, System.Net.Quic is still in preview.

…terop (dotnet#74669)

* Send buffers and handles crash fixes

* Add generated-like interop

* Apply PR feedback from dotnet#74611

* Change asserts

* Feedback + moved native methods to their own file

* PR feedback

Co-authored-by: ManickaP <mapichov@microsoft.com>
@ghost
Copy link

ghost commented Sep 7, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #74669 to release/7.0
Fixes #72696

/cc @CarnaViire

Customer Impact

HTTP/3 or QUIC application crashed with either "Aborted" or "Segmentation Fault" due to native heap corruption. The native crash happened in a time frame from several minutes to several hours, depending on how common was the race between Dispose and other QUIC calls (e.g. cancelling/disposing the stream while sending the data) in the user scenario.

The root cause of the native heap corruption was incorrect and unsynchronized usage of native pointers and arrays, which in case of multithreaded access led to use-after-free and other native memory access issues. This eventually led to native heap corruption which manifested as a crash after some time.

There are 2 main parts of the fix:

  1. Wrap the use of the native MsQuic pointers into "generated-like" SafeHandle pattern, which will ensure the disposed/closed handles are not used and handles are not disposed/closed while in use (during 7.0, interop changed from marshalling to unmanaged function pointers, at which point we have lost automatic SafeHandle "safety").
  2. In QuicStream, wrap SendBuffers native array usages into lock to ensure it will not get freed while being used by native code.

Discovered in HTTP/3 stress runs.

Testing

Multiple 10+ hours of stress test runs.

Risk

Low, System.Net.Quic is still in preview.

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@carlossanlop
Copy link
Member

@danmoseley need an approval please.

@karelz karelz added this to the 7.0.0 milestone Sep 7, 2022
@danmoseley
Copy link
Member

approved - basic stability of new feature

@danmoseley
Copy link
Member

BTW @CarnaViire great to see us using stress testing to validate in this way. Nice

@CarnaViire
Copy link
Member Author

runtime-libraries stress-http (Docker Linux) hit some infra issue

/d/a/_work/_temp/44bbd754-bdb1-4f13-9fcb-1d90dc1caf14.sh: line 1: D:a_work1s/eng/docker/build-docker-sdk.sh: No such file or directory

Happens on release/7.0 branch as well: https://dev.azure.com/dnceng-public/public/_build/results?buildId=7638&view=results

@CarnaViire
Copy link
Member Author

/azp run runtime-libraries stress-http

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@carlossanlop
Copy link
Member

Approved, signed off, and CI is green. Ready to merge. :shipit:

@carlossanlop carlossanlop merged commit 69ddb97 into dotnet:release/7.0 Sep 8, 2022
@karelz karelz requested review from rzikm and wfurt September 8, 2022 05:48
@CarnaViire CarnaViire deleted the backport/pr-74669-to-release/7.0 branch September 8, 2022 10:07
@ghost ghost locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants