Don't checkpoint when Event Hubs Listener is closing #1752

jeffhollan · 2018-06-14T14:59:46Z

Here we do a checkpoint if an EventHubListener is closing because of a server shutdown

azure-webjobs-sdk/src/Microsoft.Azure.WebJobs.Extensions.EventHubs/EventHubListener.cs

Lines 118 to 121 in 7009477

    
           if (reason == CloseReason.Shutdown) 
        
           { 
        
               await context.CheckpointAsync(); 
        
           }

I'm not sure why we would want to checkpoint in this case. Since this appears to be invokable in parallel to an execution that may be in progress (and not yet completed), seems it could result in an invalid checkpoint for in flight data.

Seems we shouldn't have these lines.

paulbatum · 2018-06-20T17:56:57Z

I agree this looks like an issue. Here's an example sequence:

Functions are in the process of executing.
For one of several reasons, we decide we need to start a graceful shutdown of the host. This calls StopAsync on the listener.
Stopping the listener triggers a call to CloseAsync on each partition listener, which cancels the cancellation token passed into user code.
User code aborts the execution (assuming they are handling the token cancellation). Their functions did not finish running
The line that Jeff highlighted checkpoints anyway, and so the system "loses" the data from the aborted executions.

@mathewc Can you see any mistakes in the above reasoning?

mathewc · 2018-06-20T18:36:05Z

Agreed - this seems like an issue. To verify, I recommend we create a simple repro that demonstrates the problem that we can use to verify the fix.

jeffhollan · 2018-06-20T21:26:07Z

Is tricky to repro is some cases as it relies on a race condition, that said I'm not sure how to simulate a CloseReason.Shutdown to hit this code path anyway? Either way I think the fix is relatively safe enough to make without a repro as the worst case we may just "at-least-once" re-deliver something where in the past it was delivered (or not delivered)

mathewc · 2018-06-21T23:07:12Z

Possibly related: ICM 73233348

Also, I chatted with @MikeStall on this code, and his recollection was that it comes from sample listener code that he got from EH samples, e.g. as in this issue. As you can see, the SimpleEventProcessor in that issue checkpoints on close. However that simple sample only Closes the listener when the app is shutting down and all events are processed. Also it does no checkpointing in the message processor. So for that simple sample it won't checkpoint any unprocessed events.

Our case is different - with our cancellation token and execution model, we CAN checkpoint for cancelled/unprocessed events in rare cases.

jeffhollan · 2018-06-22T03:08:39Z

Thanks for looking into this. Yes that ICM is what first made me inspect the code and find a potential gap. I don't think necessarily this is what they were hitting but is a possibility as it is clear during the logs of their incident during data loss was a transition of partition listeners between instances (one instance took a partition lock away from another) - so possible is related

paulbatum added the bug label Jun 20, 2018

paulbatum added this to the Triaged milestone Jun 20, 2018

mathewc added a commit that referenced this issue Jun 29, 2018

Fixes for EventHubTrigger checkpointing (#1752)

461259d

mathewc mentioned this issue Jun 29, 2018

Fixes for EventHubTrigger checkpointing (#1752) #1783

Merged

mathewc added a commit that referenced this issue Jun 29, 2018

Fixes for EventHubTrigger checkpointing (#1752)

db93b7f

mathewc self-assigned this Jun 29, 2018

mathewc added a commit that referenced this issue Jun 29, 2018

Fixes for EventHubTrigger checkpointing (#1752)

f4157c7

mathewc closed this as completed Jun 29, 2018

mathewc added a commit that referenced this issue Jul 4, 2018

Fixes for EventHubTrigger checkpointing (#1752)

72be184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't checkpoint when Event Hubs Listener is closing #1752

Don't checkpoint when Event Hubs Listener is closing #1752

jeffhollan commented Jun 14, 2018

paulbatum commented Jun 20, 2018

mathewc commented Jun 20, 2018 •

edited

Loading

jeffhollan commented Jun 20, 2018

mathewc commented Jun 21, 2018 •

edited

Loading

jeffhollan commented Jun 22, 2018

Don't checkpoint when Event Hubs Listener is closing #1752

Don't checkpoint when Event Hubs Listener is closing #1752

Comments

jeffhollan commented Jun 14, 2018

paulbatum commented Jun 20, 2018

mathewc commented Jun 20, 2018 • edited Loading

jeffhollan commented Jun 20, 2018

mathewc commented Jun 21, 2018 • edited Loading

jeffhollan commented Jun 22, 2018

mathewc commented Jun 20, 2018 •

edited

Loading

mathewc commented Jun 21, 2018 •

edited

Loading