Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQS Timeouts not firing when using async ReceiveMessageAsync #1275

Closed
danielmarbach opened this issue May 3, 2019 · 10 comments
Closed

SQS Timeouts not firing when using async ReceiveMessageAsync #1275

danielmarbach opened this issue May 3, 2019 · 10 comments
Labels
feature-request A feature should be added or improved.

Comments

@danielmarbach
Copy link
Contributor

danielmarbach commented May 3, 2019

This is a reopening of #609

As part of the above issue, the timeout property has been augmented to indicate that it is the responsibility of the caller to implement a timeout by passing a cancellation token. I do think this behavior is hard to understand due to the following reasons.

When you compare the .NET Core Version with the full framework version then you can see that the timeout property is honored due to the fact that the .NET Core version uses the HttpClient which throws an OperationCanceledException after the default timeout of 100 seconds. For the full framework version, the behavior is different. If you don't pass in a cancellation token that implements the timeout as well as potentially the shutdown SLA for EVERY async call to the SDK calls can indefinitely hang.

So as a consumer of the API I'm essentially forced to do the following

CancellationToken shutdownToken;
using(var linkedTokenSource = CancellationTokenSource.CreateLinkedTokenSource(shutdownToken)) {
    linkedTokenSource.CancelAfter(clientConfig.Timeout);

    await sdk.FooBarAsync(, linkedTokenSource.Token);
}

This is very cumbersome plus code that is framework specific and assumes someone that is using the SDK actually has to understand that the full framework implementation happens to use HttpWebRequest while the .NET Core version doesn't.

Expected Behavior

Full framework version should also honor the client config timeout and time out calls after the specified period and throw an OperationCanceledException

Current Behavior

Calls may hang indefinitely

Possible Solution

See PR

Steps to Reproduce (for bugs)

For example receive from SQS with sqsClient.ReceiveMessagesAsync() with a server timeout of 20 seconds. Pull the cable. Wait a few minutes. Reconnect the cable. Send a new message to the queue and observe how the message is never received because of the sqsClient.ReceiveMessagesAsync() hangs

Context

Particular/NServiceBus.AmazonSQS#296

Your Environment

  • AWSSDK.Core version used:
  • Service assembly and version used: AWSSDK.SQS.dll: 3.3.3.11, AWSSDK.Core.dll: 3.3.24.3
  • Operating System and version:
  • Visual Studio version:
  • Targeted .NET platform: .NET 4.5.2

.NET Core Info

Affects full framework

@klaytaybai
Copy link
Contributor

@danielmarbach, thanks for raising this again and submitting the PR. It seems reasonable to me to add this into the SDK, but I'll get the team's opinion.

@staylr
Copy link

staylr commented May 30, 2019

Hi @klaytaybai. We're the downstream user who reported this. Any update?

@danielmarbach
Copy link
Contributor Author

Any progress on this one?

@staylr
Copy link

staylr commented Jul 12, 2019

@klaytaybai ?

@danielmarbach
Copy link
Contributor Author

I don't want to sound impatient here but I think this issue requires a bit of attention after almost 3 months? // @klaytaybai @normj

@danielmarbach
Copy link
Contributor Author

Thanks!

@normj
Copy link
Member

normj commented Aug 16, 2019

No, thank you!

@slang25
Copy link
Contributor

slang25 commented Aug 16, 2019

A little late to the party, apologies. This once took down the whole Just Eat platform in a region when the NAT restarted and all SQS listeners were indefinitely awaiting, so this is a big deal, nice one @danielmarbach!

@indy-singh
Copy link
Contributor

Is there any reason why OperationCanceledException is thrown here and not a TimeoutException?

Background:

Yesterday we upgraded our AWS libs across the board and today incurred a platform-wide outage; the sqs client has a timeout of 5 seconds, but the queue is long polling at 20 seconds. I get that the scenario where the client timeout is less than queue poll time is invalid (and should always be). But I strongly believe this should have thrown a TimeoutException.

Thanks,
Indy

@danielmarbach
Copy link
Contributor Author

As far as I'm concerned when I fixed this I wanted the behaviour to be aligned across the TFMs so I aimed for the same design as the HttpClient even though some people find it suboptimal I felt it is at least consistent with the out of the box experience of the framework.

dotnet/runtime#21965

The team has decided to set the InnerException of the OperationCanceledException to a Timeoutexception to differentiate user cancellation vs timeouts. This seams to be a reasonable tradeoff that could be taken here as well. See

dotnet/runtime#2281

FYI I'm just providing context as I'm the original author of the fix that got approved and merged but I do not work for AWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

6 participants