Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpClient throws TimeoutException wrapped by TaskCancellationException when request times out #2281

Merged

Conversation

alnikola
Copy link
Contributor

@alnikola alnikola commented Jan 28, 2020

Currently, HttpClient throws the same TaskCancellationException regardless of the request cancellation reason that can be caller's cancellation, all pending request cancellation or timeout. This makes it impossible to handle a request timeout in a way different from all other cases (e.g. special retry logic).

This PR adds a timeout detection logic into HttpClient. It watches for all TaskCancelledExceptions and catches the ones triggered by the timeout timer. Then, it creates two new exceptions and build a hierarchy. The first is a TimeoutException having its InnerException set to the original TaskCancelledException. The second is a new TaskCancelledException having its InnerException set to that new TimeoutException, but preserving the original stack trace, message and cancellation token. Finally, this top-level TaskCancelledException gets thrown.

Fixes #21965

alnikola and others added 7 commits January 28, 2020 16:54
if (hasTimeout || cancellationToken.CanBeCanceled)
{
disposeCts = true;
cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, _pendingRequestsCts.Token);
if (hasTimeout)
{
timeoutTime = Environment.TickCount64 + (long) _timeout.TotalMilliseconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run some perf tests to see if this Environment.TickCount64 call shows up at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared performance of a success GetAsync path without cancellation in the 2 following cases and found that TickCount64 access adds 1.84% to execution time.

  1. Normal code reading Environment.TickCount64
    C# timeoutTime = Environment.TickCount64 + (long)_timeout.TotalMilliseconds;
  2. Hardcoded constant is used instead of Environment.TickCount64
    C# timeoutTime = 9223372000000000000L + (long)_timeout.TotalMilliseconds;

Test code

            [GlobalSetup]
            public void Setup()
            {
                var response = new HttpResponseMessage(System.Net.HttpStatusCode.OK);
                client = new HttpClient(new CustomResponseHandler((r, c) => 
                    Task.FromResult(response)));
                client.Timeout = TimeSpan.FromDays(1);
                uri = new Uri($"http://{Guid.NewGuid().ToString("N")}");
            }

            [Benchmark]
            public void Run()
            {
                for (int i = 0; i < 10000; i++)
                {
                    client.GetAsync(uri, HttpCompletionOption.ResponseHeadersRead, 
                        cts.Token).Wait();
                }
            }
Method Toolchain Mean Error StdDev Ratio
Run Constant long value 3.369 ms 0.0189 ms 0.0168 ms 0.98
Run Environment.TickCount64 3.431 ms 0.0380 ms 0.0317 ms 1.00

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub I'm curious -- I believe GetTickCount64 is just mov rax, [addr] last I looked; is there something special in our implementation of TickCount64 that is worth watching out for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following... are you referring to the mov that's part of the tail call to that function? The single mov isn't the entirety of TickCount64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be remembering incorrectly here; last time I looked was in XP x64 days. Regardless I don't think it's relevant to PR so convo so I'll go look independently 😊

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good, but still some comments/questions.

Timeout cancellation is logged
@alnikola
Copy link
Contributor Author

alnikola commented Feb 7, 2020

/azp run runtime-libraries outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@alnikola
Copy link
Contributor Author

/azp run runtime-libraries outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@alnikola
Copy link
Contributor Author

No performance degradation found in comparison with the latest master.

Name RPS Avg. Latency
Master (local build) 23,349 10.86 ms
This PR (local build) 23,404 10.94 ms

@@ -573,6 +589,12 @@ public Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, HttpCompl
}
catch (Exception e)
{
if (e is OperationCanceledException operationException && TimeoutFired(callerToken, timeoutTime))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this could be an exception filter.

Copy link
Member

@stephentoub stephentoub Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's an exception filter, then you need two catch blocks; doesn't really help anything, e.g.

catch (OperationCanceledException oce) when TimeoutFired(callerToken, timeoutTime)
{
    HandleSendAsyncTimeout(oce);
    throw CreateTimeoutException(oce);
}
catch (Exception e)
{
    HandleFinishSendAsyncError(e, cts);
    throw;
}

But either way is fine.

Copy link
Contributor

@scalablecory scalablecory Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the two catch blocks cleaner; is there a reason to avoid multiple catch blocks when they have entirely exclusive bodies? perf concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously they weren't entirely discrete:
#2281 (comment)
As they are now, it's really in the eye of the beholder. The two blocks with an exception filter will add some expense when the exception occurs, but given that it's only in the exceptional case, it's not prohibitive. Whichever you both decide is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scalablecory Do you want me to change it back to the exception filters? I'd rather not do this since it will require rerunning all perf tests for Linux and Windows which takes a lot of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that this is fine as-is. If we think this is a problem later, you can change it.

@alnikola
Copy link
Contributor Author

@stephentoub The above perf results were incorrect due to a misconfiguration. I fixed that and ran tests again. The below is for Windows. Linux tests are in progress. No degradation so far.

OS Name RPS Avg. Latency
Windows Master (public build) 101,639 2.83 ms
Windows This PR (local build) 103,330 2.81 ms

@alnikola
Copy link
Contributor Author

/azp run runtime-libraries outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@alnikola
Copy link
Contributor Author

The below is for Linux. No degradation.

OS Name RPS Avg. Latency
Linux Master (public build) 103,680 2.51 ms
Linux This PR (local build) 103,197 2.53 ms

@alnikola alnikola merged commit 7818335 into dotnet:master Feb 19, 2020
@alnikola alnikola deleted the alnikola/20296-httpclient-timeout-exception branch February 19, 2020 11:31
@stephentoub
Copy link
Member

Thanks for checking. Just to confirm, the only difference between master and local here is this commit, right?

@alnikola
Copy link
Contributor Author

I merged the latest master to PR branch a couple of days ago and compared it to the latest publicly available version, so there can be some more changes. But since no degradation is found, I don't see any issue. Is something wrong here?

@davidsh davidsh added this to the 5.0 milestone Feb 19, 2020
@stephentoub
Copy link
Member

stephentoub commented Feb 19, 2020

But since no degradation is found, I don't see any issue. Is something wrong here?

I was asking just to make sure we were comparing apples-to-apples. It sounds like it's probably fine. In the future it'd be good to rebase so you're comparing just the diff.

@sebastienros
Copy link
Member

@alnikola the issue might be that you PR is based on an old commit, and you would then compare the master's head to your PR, which could hide some improvements/regressions. So saying "But since no degradation is found" is not sufficient, you might actually have regressed a recent improvement.

To prevent that you either:

  • rebase your PR on master before doing a benchmark, and explicitly mention --self-contained on the base
  • or compare locally built dlls for both your changes and without your changes

In your case you rebased on master, so the results are legitimate

@alnikola
Copy link
Contributor Author

alnikola commented Feb 20, 2020

It seems I imprecisely formulated my message. There are two points:

  1. I merged (not rebased, but it doesn't matter) master into my branch just before running Windows tests and set -self-contained. So Windows tests result compare the PR with latest master changes against the publicly available version
  2. Linux version was tested a couple of days later because it took some time to build it and due to some perf infra issues.

However, I agree that I missed the possibility of accidentally hiding recent improvements, so in the future it would be better to compare PR against locally built master to compare only diff.

@coderb
Copy link

coderb commented Oct 5, 2020

i'm still seeing this in netcoreapp3.1 even though it is marked closed. which version of netcore is this scheduled for?

@wfurt
Copy link
Member

wfurt commented Oct 5, 2020

This will be in 5.0 @coderb. (see the milestone) I don't think this would get permission for back port.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HttpClient throws TaskCanceledException on timeout
9 participants