[core-amqp] OperationTimeoutError should be retryable error #2891

ramya-rao-a · 2019-05-14T19:20:03Z

The OperationTimeoutError that is listed in errors.ts in core-amqp gets thrown from rhea-promise with the code amqp:operation-timeout and is currently treated as a non retryable error.

With amqp/rhea-promise#42 and amqp/rhea-promise#43, amqp:operation-timeout is no longer a valid amqp code.

But since with #4171, we now need the concept of a OperationTimeoutError, below are the suggested next steps

In core-amqp:

Use version 0.2.0 of rhea-promise which has the fix to remove the invalid amqp code
Remove the mappings for amqp:operation-timeout as it is no longer a valid
translate() should treat OperationTimeoutError as a retryable error

In Event Hubs:

The error thrown when the per operation timeout is over should be OperationTimeoutError. We can re-use the one defined in rhea-promise

Before getting to code changes, a review of all existing ways an OperationTimeoutError could be thrown to the user should be done and listed. This way we know exactly what areas will be affected in which way by us making this error retryable.

The text was updated successfully, but these errors were encountered:

ramya0820 · 2019-07-06T00:07:42Z

About investigation on usage of OperationTimeoutError, following are some findings:

The error gets thrown when following operations exceed their configured timeouts
- Create AMQP session
- Close AMQP connection,
- Create AMQP sender
- Create receiver
- Close AMQP link
  - (All of above use default 60 sec, configured as 'connection timeout')
- Send message operation on AwaitableSender
  - This uses default of 20 sec, configured as sendTimeout. This seems to be different from the regular sender's that we use for which we will be managing the execution context from our SDK.

ramya-rao-a · 2019-07-06T01:40:40Z

Sweet, so it is confirmed that rhea-promise is the only source of this error at the moment?

ramya0820 · 2019-07-08T17:36:21Z

Sweet, so it is confirmed that rhea-promise is the only source of this error at the moment?

Yes as far as downstream dependencies to core-amqp is concerned, rhea-promise would be it.

From users perspective, I recall test frameworks throw a similarly named error and it's unclear if the older amqp code was replaced with anything new to denote Service indicating this condition.

But as far as our SDKs are concerned, we do handle operation timeouts on the different APIs locally - we'll be updating these as decided once clarifications are through about overall approach (see conversation at #4178 (comment))

ramya0820 · 2019-07-08T18:25:59Z

But as far as our SDKs are concerned, we do handle operation timeouts on the different APIs locally - we'll be updating these as decided once clarifications are through about overall approach (see conversation at #4178 (comment))

Keeping aside about discussions the approach; About the question in itself if OperationTimeoutError should be treated as retryable - based on the definition/nature of error where there could be temporary delays/unavailability, answer seems to be a yes.

From SDKs perspective, while the different APIs across EventHubs, Service Bus can be scrutinized to see if we can identify new timeout scenarios and throw this error to the user if appropriate, as far as currently identified operation timeout scenarios are concerned which rely on the retry utility in core-amqp

the send operations already treat OperationTimeoutError as retryable by catching and throwing SenderBusyError, so these can be updated to not handle this error locally, and let core-amqp retry
Other instances such as when invoking init / receive do not treat operation timeouts as retryable, and can be updated to catch this error and log, exit.

About all other APIs, operations that catch error and log, exit - they can remain as is as they anyway don't have provision to run using the retry utility.

ramya0820 · 2019-07-10T01:23:09Z

Discussed offline @ramya-rao-a
While #4228 is still in progress, we will be unblocking parts of this work to still come to closure/conclusion on the retryable-ness of operation timeout errors.

ramya0820 · 2019-07-11T22:47:52Z

Updated PR to get around CI issue with Rush.

About OperationTimeoutError being made retryable, since this will need changes to be made to EventHubs and, #4171, #2835 are in flight involving work on identifying targeted user experiences and expected implications of timeout on Event Hubs, further changes can be taken up once those are resolved. Because, updating implementation to add/change the way we do retries from Event Hubs would impact and result in consuming the OperationTimeoutError handling.
Specifically, we'll need closure on how the receivers will handle timeouts as these currently treat OperationTimeoutError as non-retryable.

Service Bus can also be updated, as it now depends on core-amqp as well.

cc @ramya-rao-a

ramya0820 · 2019-07-11T23:31:27Z

Discussed offline, since handling of OperationTimeoutError is ongoing and this pertains to Track 2, we are okay to scope this work to just making the error retryable.

ramya0820 · 2019-07-12T01:56:53Z

Closed with #4186

triage-new-issues bot added the triage label May 14, 2019

ramya-rao-a mentioned this issue May 14, 2019

[Service Bus] Timeout for message settlement should result in rejected promise #2886

Merged

ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Event Hubs labels May 14, 2019

triage-new-issues bot removed triage labels May 14, 2019

ramya-rao-a added this to the Sprint 153 milestone May 14, 2019

ramya-rao-a modified the milestones: Sprint 153, Sprint 155 May 30, 2019

ramya-rao-a changed the title ~~[amqp-common] OperationTimeoutError should be retryable error~~ [core-amqp] OperationTimeoutError should be retryable error Jun 26, 2019

This was referenced Jun 26, 2019

[Event Hubs] Requests are not retried when link creation fails #2835

Closed

Release Event Hubs Track 2 Library Preview 2 #4081

Closed

ramya0820 self-assigned this Jul 5, 2019

ramya0820 mentioned this issue Jul 6, 2019

[core-amqp] Update to use and handle OperationTimeoutError #4186

Merged

ramya0820 mentioned this issue Jul 8, 2019

[core-amqp] [Event Hubs] Long-term solution for OperationTimeoutError handling #4206

Closed

ramya0820 closed this as completed Jul 12, 2019

ramya-rao-a mentioned this issue Jul 14, 2019

[Event Hubs] Improve the current implementation of Sender #3375

Closed

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core-amqp] OperationTimeoutError should be retryable error #2891

[core-amqp] OperationTimeoutError should be retryable error #2891

ramya-rao-a commented May 14, 2019 •

edited

Loading

ramya0820 commented Jul 6, 2019

ramya-rao-a commented Jul 6, 2019

ramya0820 commented Jul 8, 2019

ramya0820 commented Jul 8, 2019

ramya0820 commented Jul 10, 2019

ramya0820 commented Jul 11, 2019

ramya0820 commented Jul 11, 2019

ramya0820 commented Jul 12, 2019

[core-amqp] OperationTimeoutError should be retryable error #2891

[core-amqp] OperationTimeoutError should be retryable error #2891

Comments

ramya-rao-a commented May 14, 2019 • edited Loading

ramya0820 commented Jul 6, 2019

ramya-rao-a commented Jul 6, 2019

ramya0820 commented Jul 8, 2019

ramya0820 commented Jul 8, 2019

ramya0820 commented Jul 10, 2019

ramya0820 commented Jul 11, 2019

ramya0820 commented Jul 11, 2019

ramya0820 commented Jul 12, 2019

ramya-rao-a commented May 14, 2019 •

edited

Loading