Scheduler & LeaseCoordinator run/shutdown race condition #427

danielcerutti · 2018-09-28T13:08:09Z

Scheduler::shutdown (or Worker for older revisions) can be executed before completing the Scheduler::initialize phase after calling Scheduler::run. This causes LeaseCoordinator::stop (where the LeaseTaker is canceled and shutdown) the to be called before LeaseCoordinator::start where the LeaseTaker is actually initialized. This eventually causes the Scheduler and LeaseCoordinator to be in a shutdown state while leaving the LeaseCoordinator's leaseCoordinatorThreadPool & takerFuture to remain executing while taking leases from other Schedulers that are actually running.

Simply running the following test can reproduce the problem:

    @Test
    public void runShutdownRaceCondition() throws Exception
    {
        new Thread(scheduler_).start();
        scheduler_.shutdown();

        while (true) {
            Thread.sleep(Long.MAX_VALUE);
        }
    }

The text was updated successfully, but these errors were encountered:

sahilpalvia · 2018-10-18T22:03:06Z

Version 2.0.4 is now available. Closing this issue. Feel free to reopen if problem persists.

matiaslb · 2018-10-18T22:05:24Z

Are there any plans on providing a fix for v1?

dharmeshspatel4u · 2019-02-09T21:46:05Z

@pfifer with client v1.8.1 having below issue. I found this issue close to my issue.


2019-02-09 21:10:01.875  INFO 26971 --- [      Thread-29] c.a.s.k.clientlibrary.lib.worker.Worker  : Worker shutdown requested.
2019-02-09 21:10:01.876  INFO 26971 --- [      Thread-29] c.a.s.k.leases.impl.LeaseCoordinator     : Worker ip-1234. has successfully stopped lease-tracking threads
2019-02-09 21:10:01.877  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Checkpointing shard shardId-000000000000
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] k.c.l.w.KinesisClientLibLeaseCoordinator : Worker ip-1234. could not update checkpoint for shard shardId-000000000000 because it does not hold the lease
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Caught shutdown exception, skipping checkpoint.

com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:174) ~[amazon-kinesis-client-1.8.1.jar!/:na]

Any clue, if this is my issue? I see sometimes checkpoint gets updated, sometimes throws above error and it delivers again those messages back to consumer. Is this fixed in 2.0.4? I have very major upgrade to do so looking to seeif 1.x has fix for above issue.

appreciate your quick response.

pfifer added bug v2.x Issues related to the 2.x version labels Oct 9, 2018

sahilpalvia mentioned this issue Oct 9, 2018

Quick fix for shutdown race condition #439

Merged

sahilpalvia added this to the v2.0.4 milestone Oct 9, 2018

sahilpalvia closed this as completed Oct 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler & LeaseCoordinator run/shutdown race condition #427

Scheduler & LeaseCoordinator run/shutdown race condition #427

danielcerutti commented Sep 28, 2018

sahilpalvia commented Oct 18, 2018 •

edited

Loading

matiaslb commented Oct 18, 2018

dharmeshspatel4u commented Feb 9, 2019

Scheduler & LeaseCoordinator run/shutdown race condition #427

Scheduler & LeaseCoordinator run/shutdown race condition #427

Comments

danielcerutti commented Sep 28, 2018

sahilpalvia commented Oct 18, 2018 • edited Loading

matiaslb commented Oct 18, 2018

dharmeshspatel4u commented Feb 9, 2019

sahilpalvia commented Oct 18, 2018 •

edited

Loading