Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle internal errors like OOM in entity processing by aborting the … #2234

Merged
merged 3 commits into from
Aug 12, 2022

Conversation

sebastianburckhardt
Copy link
Collaborator

@sebastianburckhardt sebastianburckhardt commented Jul 22, 2022

Customer observed a case (#2166) where an entity goes into a permanently failed state after encountering an OOM error.

This PR addresses this issue by

  • catching OOM (and other) exceptions inside the TaskEntityShim.Execute method
  • throwing a SessionAbortedException at some later point to ensure the workitem gets aborted

Issue describing the changes in this PR

resolves #2166

Pull request checklist

  • My changes do not require documentation changes
  • I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question which may or may not require changes, but otherwise I think this change looks good.

// here so DTFx can abort the batch and back off the work item
entityContext.ThrowInternalExceptionIfAny();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there were two places where we previously called ThrowInternalExceptionIfAny. Does the other location not require any changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other location is alright: it throws after DTFx has done its thing (persist history etc). At that point, throwing an exception is simply a way to tell the Functions telemetry that something went wrong inside this function. We do this for user code errors also, so they show up in the portal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably test that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tested it: it is fine. Those exceptions are being caught and wrapped, and then reported. My explanation above was not right though: it does not happen after DTFx finishes, but before.

@sebastianburckhardt sebastianburckhardt merged commit b47e9c8 into dev Aug 12, 2022
@davidmrdavid davidmrdavid deleted the sebastian/handle-cornercase-oom branch August 15, 2022 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Durable Entity cannot be reached anymore , seems like signals are being queued and not read
2 participants