Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketsHttpHandler.SendAsync reads request Content stream twice #53914

Open
davidni opened this issue Jun 9, 2021 · 30 comments
Open

SocketsHttpHandler.SendAsync reads request Content stream twice #53914

davidni opened this issue Jun 9, 2021 · 30 comments
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@davidni
Copy link
Contributor

davidni commented Jun 9, 2021

Description

Under very specific conditions, SocketsHttpHandler.SendAsync when using HTTP/2 attempts to copy the HttpRequestMessage.Content twice.

In Island Gateway (AKA what eventually became YARP), we have code similar to the following, and this invariant that we do a best-effort check is being violated:

https://github.com/microsoft/reverse-proxy/blob/961608d01adc9ab07d4cd3f5adf7d4d3dfff3f0f/src/ReverseProxy/Proxy/StreamCopyHttpContent.cs#L118-L125

The following criteria seem to be necessary to trigger this bug, though it may be broader than this:

  • Request specifies Expect: 100-continue
  • Destination does not produce a 100 response within the first 1 second
  • HttpConnectionPool.SendWithRetryAsync decides to retry (just a hypothesis)

It seems like the SocketsHttpHandler machinery may be deciding it is safe to retry because the destination service provably did not start processing the first attempt, but it seems to make the wrong assumption that it is okay to read the HttpRequestMessage.Content again on the next attempt. This is not a valid assumption, and leads to issues in applications such as YARP where the Content is a wrapper around Kestrel's incoming request body stream, which cannot be read twice.

Configuration

  • .NET Core 3.1.15
  • Windows Server Datacenter 2019

Other information

Filing this as a result of an offline discussion with @halter73 and @davidfowl. Thanks folks for helping investigate.

@karelz FYI.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Jun 9, 2021
@ghost
Copy link

ghost commented Jun 9, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Under very specific conditions, SocketsHttpHandler.SendAsync when using HTTP/2 attempts to copy the HttpRequestMessage.Content twice.

In Island Gateway (AKA what eventually became YARP), we have code similar to the following, and this invariant that we do a best-effort check is being violated:

https://github.com/microsoft/reverse-proxy/blob/961608d01adc9ab07d4cd3f5adf7d4d3dfff3f0f/src/ReverseProxy/Proxy/StreamCopyHttpContent.cs#L118-L125

The following criteria seem to be necessary to trigger this bug, though it may be broader than this:

  • Request specifies Expect: 100-continue
  • Destination does not produce a 100 response within the first 1 second
  • HttpConnectionPool.SendWithRetryAsync decides to retry (just a hypothesis)

It seems like the SocketsHttpHandler machinery may be deciding it is safe to retry because the destination service provably did not start processing the first attempt, but it seems to make the wrong assumption that it is okay to read the HttpRequestMessage.Content again on the next attempt. This is not a valid assumption, and leads to issues in applications such as YARP where the Content is a wrapper around Kestrel's incoming request body stream, which cannot be read twice.

Configuration

  • .NET Core 3.1.15
  • Windows Server Datacenter 2019

Other information

Filing this as a result of an offline discussion with @halter73 and @davidfowl. Thanks folks for helping investigate.

@karelz FYI.

Author: davidni
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@geoffkizer
Copy link
Contributor

Can you try with a recent .NET 6 build?

There are a bunch of situations that can cause us to try to read the content stream twice. If the content doesn't support this, it can throw and we should fail the request. There's not really anything else we can do at that point.

@geoffkizer
Copy link
Contributor

More info:

Things that can cause the content stream to be read twice include stuff like redirects and authentication. We will handle specific status codes like 302 or 401 and issue another request and try to send the content again. Expect: 100-continue can be used to try to avoid having to read the content twice in cases like this, because hopefully we will get the 302 or 401 before proceeding to send content; but obviously this only will work if the expect timeout doesn't expire.

Are you using redirects or authentication here?

We do have logic to retry requests on connection shutdown -- that's what HttpConnectionPool.SendWithRetryAsync is about. But this shouldn't cause the content stream to be read twice. Once we start sending any content, the request is no longer retryable.

@davidni
Copy link
Contributor Author

davidni commented Jun 9, 2021

Thanks Geoff.

1: Unfortunately I do not have a minimal repro of this atm and I am unable to verify on .NET 6 (we will not transition our production workloads before GA).

2: There are no redirects nor authentication at play here. Connection shutdown is likely the trigger -- during the incident where we had this happen, the destination service was overwhelmed and couldn't handle the load. This meant many of our requests were hitting the 1-second timeout for Expect:100-continue. And in at least a few cases, this did cause the content stream to be read twice. We have unambiguous evidence of this based on a few occurrences of the error I mentioned in the issue above ("Stream was already consumed"). We have indirect evidence of this based on Kestrel logs with the following stack trace, indicating two concurrent calls to read the request content (look for StreamCopyHttpContent.SerializeToStreamAsync below):

System.NotSupportedException:  The ReadAsync method cannot be called when another read operation is pending.
   at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
   at System.IO.Pipelines.StreamPipeReader.ReadAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1ContentLengthMessageBody.ReadAsyncInternal(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.ReadAsyncInternal(Memory`1 buffer, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.StreamCopier.CopyAsync(Stream source, Stream destination, CancellationToken cancellation)
   at IslandGateway.Core.Service.Proxy.StreamCopyHttpContent.SerializeToStreamAsync(Stream stream, TransportContext context)
   at System.Net.Http.HttpContent.CopyToAsyncCore(ValueTask copyTask)
   at System.Net.Http.Http2Connection.Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.HttpProxy.<>c__DisplayClass13_0.<<ExecuteRequestAsync>b__0>d.MoveNext()

@geoffkizer
Copy link
Contributor

Ahhhh, this is HTTP2. I had assumed HTTP/1.1.

Yes, with HTTP2, there are additional scenarios. When we receive a GOAWAY (server-initiated connection shutdown), the server tells us which streams will be processed and which won't; the ones that haven't are retryable, even if we already started sending the content. So in this case you will definitely see us try to read the content twice.

That said, this is expected behavior. If you can't support reading the content again, then you should throw and the request should be failed.

@davidni
Copy link
Contributor Author

davidni commented Jun 9, 2021

@Tratcher for your take re. YARP. If confirmed this is acceptable, we would at least want to make sure we are handling it correctly and emitting the right data to telemetry. E.g. perhaps this should count as a Transport error (?)

@geoffkizer
Copy link
Contributor

You can also consider increasing the Expect100ContinueTimeout so that you'll hit this less frequently.

@Tratcher
Copy link
Member

Tratcher commented Jun 9, 2021

We have indirect evidence of this based on Kestrel logs with the following stack trace, indicating two concurrent calls to read the request content

This part's concerning, it means SerializeToStreamAsync is called concurrently on two threads and the Started check in YARP isn't thread safe. The two concurrent threads make sense if the retry is happing at the same time as the 100-continue expires. @geoffkizer that sounds like a potential race in the client logic. If the 100-continue expired and is sending content then the retry should wait for that to abort.

This could be mitigated by making the Started check in YARP thread safe. I'll file an issue for that (microsoft/reverse-proxy#1068).

As for the retry scenario in general, yes that's expected to fail in YARP and it should be reported accordingly. What type of failure is it currently reported as? It might depend on if it sent the whole request body before retrying, or if it was aborted part way through.

@Tratcher
Copy link
Member

Tratcher commented Jun 9, 2021

Raising the 100-continue timeout is a valid mitigation and shouldn't affect responsive servers.

@geoffkizer
Copy link
Contributor

@geoffkizer that sounds like a potential race in the client logic. If the 100-continue expired and is sending content then the retry should wait for that to abort.

Looking at the code, we should be waiting for SerializeToStreamAsync to complete before we actually fail the request. So I don't see how we could end up with concurrent SerializeToStreamAsync calls here. (I'm looking at 6.0 code, but I don't think this has changed much since 3.1.)

Note that in cases like this, where we get a GOAWAY or similar and need to cancel sending the request body, we will do two things:

(1) Cancel the cancellation token passed to SerializeToStreamAsync
(2) Throw on any attempt to use the request body stream passed to SerializeToStreamAsync

@davidni
Copy link
Contributor Author

davidni commented Jun 11, 2021

(1) Cancel the cancellation token passed to SerializeToStreamAsync

@geoffkizer there's no cancellation parameter here, can you clarify what you're referring to? This may be the issue at hand. We really are making no effort to stop the first SerializeToStreamAsync, because there doesn't seem to be any signal we should be reacting to.

public abstract class HttpContent : IDisposable
{
  //...
  protected abstract Task SerializeToStreamAsync(Stream stream, TransportContext context);
}

(2) Throw on any attempt to use the request body stream passed to SerializeToStreamAsync

The failure we are experiencing happens before we attempt to write to the request body stream. We were still waiting to read bytes from the other side before proxying to the destination.

@Tratcher
Copy link
Member

SerializeToStreamAsync added a new cancellable overload in 5.0. YARP hasn't implemented it yet since we're still supporting 3.1.

protected virtual Task SerializeToStreamAsync(Stream stream, TransportContext? context, CancellationToken cancellationToken) =>

@MihaZupan
Copy link
Member

Wouldn't we throw it away in YARP anyway? Right now we have a comment in the code about it

Note we do not need to implement the SerializeToStreamAsync(Stream stream, TransportContext context, CancellationToken token) overload because the token it provides is the short request token, not the long body token passed from the constructor. Using the short token would break bidirectional streaming scenarios.

@Tratcher
Copy link
Member

Note that in cases like this, where we get a GOAWAY or similar and need to cancel sending the request body, we will do two things:

(1) Cancel the cancellation token passed to SerializeToStreamAsync

@geoffkizer @MihaZupan is this a linked token? What are the possible triggers for it?

@MihaZupan MihaZupan removed the untriaged New issue has not been triaged by the area owner label Jun 29, 2021
@karelz
Copy link
Member

karelz commented Aug 17, 2021

@MihaZupan did you get chance to look at it? It is not clear why untriaged label was removed.

@karelz karelz added the untriaged New issue has not been triaged by the area owner label Aug 17, 2021
@MihaZupan
Copy link
Member

MihaZupan commented Aug 17, 2021

From reading the code, it looks like it is possible that the request content will be read multiple times concurrently on HTTP 1:

This could cause the exception seen in dotnet/aspnetcore#33409 (comment).

Judging from the stack trace seen by David, I suspect a similar race exists for HTTP/2.

@geoffkizer If we're saying that the content may be read multiple times, are we also saying that those reads may happen concurrently?
If so, we should fix cases like StreamContent that aren't using a thread-safe check.
If not, we should fix races in Http*Connection.

@MihaZupan MihaZupan added the bug label Aug 17, 2021
@geoffkizer
Copy link
Contributor

@geoffkizer If we're saying that the content may be read multiple times, are we also saying that those reads may happen concurrently?

No, the read should never happen concurrently.

@geoffkizer
Copy link
Contributor

Your analysis seems correct to me.

Seems like we need to possibly do a couple things here.

The primary one is, we should wait for the sendRequestContentTask to complete before we propagate the exception upward. That should ensure we don't read concurrently from the request content.

However, doing that may introduce some new issues. In order to ensure the sendRequestContentTask completes promptly (or indeed at all), we need to cancel it. Unlike HTTP2, we don't seem to have logic to do this for HTTP1.1; I think we should.

Judging from the stack trace seen by David, I suspect a similar race exists for HTTP/2.

It's certainly possible, but the HTTP2 code is pretty different so we should investigate and confirm

One thing I noticed from the stack trace you linked is, the derived HttpContent class doesn't implement the SerializeToStreamAsync overload with a cancellation token. That means that even if we try to cancel sending the request body, it won't necessarily be cancelled promptly. I'm not sure this matters or not in terms of reproducing the issue -- I can't see how it would matter, but maybe I'm missing something.

As an aside, this all makes me wish we had common Expect: 100-Continue logic across different HTTP versions. We now have this implemented in three different places, unfortunately.

@karelz karelz added this to the 7.0.0 milestone Aug 17, 2021
@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Aug 17, 2021
@karelz
Copy link
Member

karelz commented Aug 17, 2021

Triage: There is workaround (lock in custom HttpContent), so not critical for 6.0. But we should fix it -- i.e. make sure we don't start 2nd read before fully cancelling the 1st one.

@geoffkizer
Copy link
Contributor

geoffkizer commented Aug 19, 2021

Actually I think there's an even more basic issue here:

Thread A reads an EOF from the server, sees _startedSendingRequestBody == false, sets _canRetry = true and throws an IO exception
Timer wakes up thread B, signals allowExpect100ToContinue with true and starts reading the request content

We shouldn't be both setting _canRetry = true and actually starting to send the body. We either need to prevent the body from sending when _canRetry is set to true, or if we can't (i.e. we lose the race), then don't set _canRetry to true.

Edit: And if we did this, we would prevent the concurrent read because we'd never retry if we started sending the content.

@davidni
Copy link
Contributor Author

davidni commented Aug 26, 2021

We recently started seeing this issue on HTTP/2 even without Expect:100-continue. Interestingly, it happened for both GET's and POST's. It seems likely that this could be related to timing issues between the HttpClient stack and the HTTP/2 server that it is talking to. See e.g. microsoft/reverse-proxy#711 (comment)

Example callstack of the recent occurrences is below (we had >400 hits of this in a few hours):

System.Net.Http.HttpRequestException: An error occurred while sending the request.
 ---> System.InvalidOperationException: Stream was already consumed.
   at IslandGateway.Core.Service.Proxy.StreamCopyHttpContent.SerializeToStreamAsync(Stream stream, TransportContext context)
   at System.Net.Http.HttpContent.CopyToAsyncCore(ValueTask copyTask)
   at System.Net.Http.Http2Connection.Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at Microsoft.PowerApps.CoreFramework.Communication.Http.HttpClientCorrelationHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.HttpProxy.<>c__DisplayClass13_0.<<ExecuteRequestAsync>b__0>d.MoveNext()

@geoffkizer
Copy link
Contributor

As I mentioned above -- there are various scenarios where we will try to send the content twice, and if the content doesn't support this, it will fail.

It should never try to send the content concurrently, but as long as it's not, then trying to send the content twice is expected in the scenarios discussed above.

@geoffkizer
Copy link
Contributor

BTW I assume this is still .NET Core 3.1? I.e. no code has changed on the client here?

@davidni
Copy link
Contributor Author

davidni commented Aug 26, 2021

Correct, still .NET Core 3.1. Likely the only thing that changed was request timing patterns as well as the destination services we are routing to.

@shawncx
Copy link

shawncx commented Apr 25, 2022

Hi @geoffkizer , @Tratcher , we meet the exact same issue. We are using .NET Core 3.1 and HTTP/2. May I ask is there any recommended action to mitigate this issue? Or what kind of error is recommended to return to user for this case?

@karelz
Copy link
Member

karelz commented Jul 21, 2022

Triage: This problem exists already in 6.0, so not a regression.
Only 1 customer report which has workaround in YARP.
We should look at it from reliability perspective, but given low impact on customers, moving to 8.0.

@karelz karelz modified the milestones: 7.0.0, 8.0.0 Jul 21, 2022
@karelz karelz added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jul 21, 2022
@ManickaP
Copy link
Member

Triage: still limited customer impact, moving to future.

@MihaZupan
Copy link
Member

I think I figured out how concurrent calls to SerializeToStreamAsync can occur with HTTP/2 as well:

  1. Send a request with Expect 100-Continue set.
  2. The Http2Connection sends the request headers and calls Http2Stream.SendRequestBodyAsync, storing the requestBodyTask.
  3. SendRequestBodyAsync starts waiting on WaitFor100ContinueAsync and yields.
  4. The Http2Connection starts waiting on the request body or response headers tasks to complete.
  5. After some time, the Expect100ContinueTimeout timer in WaitFor100ContinueAsync fires. SendRequestBodyAsync sees that the request cancellation hasn't been requested yet and is about to call HttpContent.SerializeToStreamAsync in the background.
  6. At about the same time in a different thread, the server sends us a GOAWAY frame or a REFUSED_STREAM RST. Http2Connection decides that the request is retryable and calls Http2Stream.OnReset with canRetry: true.
  7. OnReset() sets Http2Stream._canRetry to true and calls Http2Stream.Cancel(), which signals the waiter.
  8. The waiter getting signaled wakes up the ReadResponseHeadersAsync task, which sees that the stream was reset and throws the retryable exception.
  9. The Http2Connection.SendAsync which was waiting on the request body or response headers to complete now wakes up and sees that the response headers task completed first.
  10. It doesn't wait for the requestBodyTask to complete, but sets up a continuation that will observe any exceptions.
  11. The retryable exception from responseHeadersTask is propagated out of SendAsync and into HttpConnectionPool.SendWithVersionDetectionAndRetryAsync which will now attempt to send the request again.
  12. The request is sent again, and HttpContent.SerializeToStreamAsync is called.
  13. In the background, the original SendRequestBodyAsync task (nr. 3) may just now be calling into SerializeToStreamAsync at the same time.

The gist is that we aren't waiting for requestBodyTask to complete before returning a retryable exception from Http2Connection.SendAsync, allowing SerializeToStreamAsync to be called multiple times concurrently.

@sberube
Copy link

sberube commented Sep 16, 2024

Triage: This problem exists already in 6.0, so not a regression. Only 1 customer report which has workaround in YARP. We should look at it from reliability perspective, but given low impact on customers, moving to 8.0.

@karelz we seem to observe the same behavior using YARP 2.2 and .NET Core 8. You were mentionning that 1 customer has a workaround, what is this workaround and how can we add it in our system?

Thanks!

@MihaZupan
Copy link
Member

MihaZupan commented Sep 16, 2024

@sberube I take it you're seeing the "Stream was already consumed." error message?
See microsoft/reverse-proxy#1683

That may happen during normal proxy operation if HttpClient decides to retry a request, something that YARP doesn't support as we don't buffer the request body.

This issue is tracking a bug in SocketsHttpHandler where those reads may happen concurrently. The workaround mentioned is mitigating just that aspect, and it's being used by YARP internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

8 participants