Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't checkpoint when Event Hubs Listener is closing #1752

Closed
jeffhollan opened this issue Jun 14, 2018 · 5 comments
Closed

Don't checkpoint when Event Hubs Listener is closing #1752

jeffhollan opened this issue Jun 14, 2018 · 5 comments
Assignees
Labels
Milestone

Comments

@jeffhollan
Copy link

Here we do a checkpoint if an EventHubListener is closing because of a server shutdown

if (reason == CloseReason.Shutdown)
{
await context.CheckpointAsync();
}

I'm not sure why we would want to checkpoint in this case. Since this appears to be invokable in parallel to an execution that may be in progress (and not yet completed), seems it could result in an invalid checkpoint for in flight data.

Seems we shouldn't have these lines.

@paulbatum
Copy link
Member

I agree this looks like an issue. Here's an example sequence:

  1. Functions are in the process of executing.
  2. For one of several reasons, we decide we need to start a graceful shutdown of the host. This calls StopAsync on the listener.
  3. Stopping the listener triggers a call to CloseAsync on each partition listener, which cancels the cancellation token passed into user code.
  4. User code aborts the execution (assuming they are handling the token cancellation). Their functions did not finish running
  5. The line that Jeff highlighted checkpoints anyway, and so the system "loses" the data from the aborted executions.

@mathewc Can you see any mistakes in the above reasoning?

@paulbatum paulbatum added the bug label Jun 20, 2018
@paulbatum paulbatum added this to the Triaged milestone Jun 20, 2018
@mathewc
Copy link
Member

mathewc commented Jun 20, 2018

Agreed - this seems like an issue. To verify, I recommend we create a simple repro that demonstrates the problem that we can use to verify the fix.

@jeffhollan
Copy link
Author

Is tricky to repro is some cases as it relies on a race condition, that said I'm not sure how to simulate a CloseReason.Shutdown to hit this code path anyway? Either way I think the fix is relatively safe enough to make without a repro as the worst case we may just "at-least-once" re-deliver something where in the past it was delivered (or not delivered)

@mathewc
Copy link
Member

mathewc commented Jun 21, 2018

Possibly related: ICM 73233348

Also, I chatted with @MikeStall on this code, and his recollection was that it comes from sample listener code that he got from EH samples, e.g. as in this issue. As you can see, the SimpleEventProcessor in that issue checkpoints on close. However that simple sample only Closes the listener when the app is shutting down and all events are processed. Also it does no checkpointing in the message processor. So for that simple sample it won't checkpoint any unprocessed events.

Our case is different - with our cancellation token and execution model, we CAN checkpoint for cancelled/unprocessed events in rare cases.

@jeffhollan
Copy link
Author

Thanks for looking into this. Yes that ICM is what first made me inspect the code and find a potential gap. I don't think necessarily this is what they were hitting but is a possibility as it is clear during the logs of their incident during data loss was a transition of partition listeners between instances (one instance took a partition lock away from another) - so possible is related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants