Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core-amqp] OperationTimeoutError should be retryable error #2891

Closed
ramya-rao-a opened this issue May 14, 2019 · 8 comments
Closed

[core-amqp] OperationTimeoutError should be retryable error #2891

ramya-rao-a opened this issue May 14, 2019 · 8 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Milestone

Comments

@ramya-rao-a
Copy link
Contributor

ramya-rao-a commented May 14, 2019

The OperationTimeoutError that is listed in errors.ts in core-amqp gets thrown from rhea-promise with the code amqp:operation-timeout and is currently treated as a non retryable error.

With amqp/rhea-promise#42 and amqp/rhea-promise#43, amqp:operation-timeout is no longer a valid amqp code.

But since with #4171, we now need the concept of a OperationTimeoutError, below are the suggested next steps

In core-amqp:

  • Use version 0.2.0 of rhea-promise which has the fix to remove the invalid amqp code
  • Remove the mappings for amqp:operation-timeout as it is no longer a valid
  • translate() should treat OperationTimeoutError as a retryable error

In Event Hubs:

  • The error thrown when the per operation timeout is over should be OperationTimeoutError. We can re-use the one defined in rhea-promise

Before getting to code changes, a review of all existing ways an OperationTimeoutError could be thrown to the user should be done and listed. This way we know exactly what areas will be affected in which way by us making this error retryable.

@ramya-rao-a ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Event Hubs labels May 14, 2019
@ramya-rao-a ramya-rao-a added this to the Sprint 153 milestone May 14, 2019
@ramya-rao-a ramya-rao-a modified the milestones: Sprint 153, Sprint 155 May 30, 2019
@ramya-rao-a ramya-rao-a changed the title [amqp-common] OperationTimeoutError should be retryable error [core-amqp] OperationTimeoutError should be retryable error Jun 26, 2019
@ramya0820 ramya0820 self-assigned this Jul 5, 2019
@ramya0820
Copy link
Member

About investigation on usage of OperationTimeoutError, following are some findings:

  • The error gets thrown when following operations exceed their configured timeouts
    • Create AMQP session

    • Close AMQP connection,

    • Create AMQP sender

    • Create receiver

    • Close AMQP link

      • (All of above use default 60 sec, configured as 'connection timeout')
    • Send message operation on AwaitableSender

      • This uses default of 20 sec, configured as sendTimeout. This seems to be different from the regular sender's that we use for which we will be managing the execution context from our SDK.

@ramya-rao-a
Copy link
Contributor Author

Sweet, so it is confirmed that rhea-promise is the only source of this error at the moment?

@ramya0820
Copy link
Member

Sweet, so it is confirmed that rhea-promise is the only source of this error at the moment?

Yes as far as downstream dependencies to core-amqp is concerned, rhea-promise would be it.

From users perspective, I recall test frameworks throw a similarly named error and it's unclear if the older amqp code was replaced with anything new to denote Service indicating this condition.

But as far as our SDKs are concerned, we do handle operation timeouts on the different APIs locally - we'll be updating these as decided once clarifications are through about overall approach (see conversation at #4178 (comment))

@ramya0820
Copy link
Member

But as far as our SDKs are concerned, we do handle operation timeouts on the different APIs locally - we'll be updating these as decided once clarifications are through about overall approach (see conversation at #4178 (comment))

Keeping aside about discussions the approach; About the question in itself if OperationTimeoutError should be treated as retryable - based on the definition/nature of error where there could be temporary delays/unavailability, answer seems to be a yes.

From SDKs perspective, while the different APIs across EventHubs, Service Bus can be scrutinized to see if we can identify new timeout scenarios and throw this error to the user if appropriate, as far as currently identified operation timeout scenarios are concerned which rely on the retry utility in core-amqp

  • the send operations already treat OperationTimeoutError as retryable by catching and throwing SenderBusyError, so these can be updated to not handle this error locally, and let core-amqp retry
  • Other instances such as when invoking init / receive do not treat operation timeouts as retryable, and can be updated to catch this error and log, exit.

About all other APIs, operations that catch error and log, exit - they can remain as is as they anyway don't have provision to run using the retry utility.

@ramya0820
Copy link
Member

Discussed offline @ramya-rao-a
While #4228 is still in progress, we will be unblocking parts of this work to still come to closure/conclusion on the retryable-ness of operation timeout errors.

@ramya0820
Copy link
Member

Updated PR to get around CI issue with Rush.

About OperationTimeoutError being made retryable, since this will need changes to be made to EventHubs and, #4171, #2835 are in flight involving work on identifying targeted user experiences and expected implications of timeout on Event Hubs, further changes can be taken up once those are resolved. Because, updating implementation to add/change the way we do retries from Event Hubs would impact and result in consuming the OperationTimeoutError handling.
Specifically, we'll need closure on how the receivers will handle timeouts as these currently treat OperationTimeoutError as non-retryable.

Service Bus can also be updated, as it now depends on core-amqp as well.

cc @ramya-rao-a

@ramya0820
Copy link
Member

Discussed offline, since handling of OperationTimeoutError is ongoing and this pertains to Track 2, we are okay to scope this work to just making the error retryable.

@ramya0820
Copy link
Member

Closed with #4186

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Projects
None yet
Development

No branches or pull requests

2 participants