Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

lhotari · 2021-04-12T14:54:29Z

Motivation

Broker shutdown isn't currently graceful although the intention for it is to be graceful.
Another problem in tests is that broker shutdown doesn't shutdown synchronously.
These 2 problems are addressed in this PR.

The changes will also help improve CI stability when the asynchronous tasks of broker shutdown can be controlled in tests. This prevents problems which are caused by too many brokers being active at the same time. This could currently happen when previous brokers are asynchronously shutting down.

Modifications

Goal of changes: shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs
- wait for event loops getting closed before continuing to close other services in broker shutdown
- configure event loop shutdown parameters since the default shutdown timeout is 15 seconds after a 2 second idle time.
- close executor services forcefully if the broker shutdown times out
- use shutdown timeout of 0 in tests. This triggers immediate forceful shutdown.

lhotari · 2021-04-12T18:36:21Z

This PRs continues the work started in PR #9308 .

michaeljmarshall

I am new to the broker's clean shutdown, but based on reading through your PR, these changes look good to me other than single variable that I think might benefit from the volatile keyword.

...common/src/main/java/org/apache/pulsar/common/util/CompletableFutureCancellationHandler.java

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

pulsar-broker/src/main/java/org/apache/pulsar/broker/MessagingServiceShutdownHook.java

pulsar-broker/src/main/java/org/apache/pulsar/broker/PulsarService.java

lhotari · 2021-04-15T10:04:31Z

/pulsarbot run-failure-checks

lhotari · 2021-04-15T11:15:07Z

/pulsarbot run-failure-checks

sijie · 2021-04-15T15:39:55Z

@merlimat to review

lhotari · 2021-04-15T16:01:03Z

/pulsarbot run-failure-checks

merlimat

👍 The approach LGTM. Just a couple of questions on the await termination

pulsar-broker/src/main/java/org/apache/pulsar/broker/PulsarService.java

.../src/main/java/org/apache/pulsar/broker/service/GracefulExecutorServicesShutdownHandler.java

lhotari · 2021-04-16T09:30:08Z

The approach LGTM. Just a couple of questions on the await termination

@merlimat Thank you for your review and good points about await termination.
Please ignore some of the replies I made since I didn't look at the full context of the code when replying.
I revisited the code after another check at your review comments.
I have rewritten the solution to use .awaitTermination and replaced the previous solution that used a ScheduledExecutorService. Please take another look.

- make cancelAction eligible for GC after the future completes (or gets cancelled)

- shutdown forcefully when closing times out

…vice

…n timeout is 0

- handle graceful / forcefully shutdown also for PulsarService executors

…dling

lhotari · 2021-04-16T13:51:38Z

/pulsarbot run-failure-checks

### Motivation The KoP CI tests take much more time than CI tests of branch-2.7.2. The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown. The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero in tests. After fixing these problems, sometimes the following error may happen and cause flakiness. ``` TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration() getConfiguration() should return ServiceConfiguration *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. ``` It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail. ### Modifications - Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`. - Fix the flakiness caused by mocking compactor. After the changes, the tests time has a significant improvement. For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished. Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds. * Speed up tests and fix flakiness * Speed up tests that creat their own configs * Ignore golang-sarama tests

lhotari mentioned this pull request Apr 12, 2021

[CI] Move flaky tests that fail very often to "quarantine" test group #10148

Merged

michaeljmarshall requested changes Apr 13, 2021

View reviewed changes

...common/src/main/java/org/apache/pulsar/common/util/CompletableFutureCancellationHandler.java Show resolved Hide resolved

lhotari force-pushed the lh-completely-shutdown-broker branch from dfc6747 to 32fa7d1 Compare April 13, 2021 06:49

codelipenghui reviewed Apr 13, 2021

View reviewed changes

codelipenghui requested review from sijie, jiazhai and merlimat April 13, 2021 12:24

codelipenghui assigned lhotari Apr 13, 2021

codelipenghui added the area/test label Apr 13, 2021

codelipenghui added this to the 2.8.0 milestone Apr 13, 2021

lhotari force-pushed the lh-completely-shutdown-broker branch 2 times, most recently from 405c093 to e1febb1 Compare April 15, 2021 09:23

lhotari requested a review from codelipenghui April 15, 2021 09:31

lhotari force-pushed the lh-completely-shutdown-broker branch from e1febb1 to 79e800a Compare April 15, 2021 12:37

lhotari mentioned this pull request Apr 15, 2021

Wait for the async broker port listener close operations to complete at shutdown #9308

Merged

codelipenghui approved these changes Apr 15, 2021

View reviewed changes

lhotari force-pushed the lh-completely-shutdown-broker branch from 79e800a to 4e7488a Compare April 15, 2021 15:05

merlimat reviewed Apr 15, 2021

View reviewed changes

lhotari added 6 commits April 16, 2021 15:10

Wait for shutdown of BrokerService event loops

669b548

Move CompletableFutureCancellationHandler to pulsar-common util

d77e266

Prevent misusage of CompletableFutureCancellationHandler

40f7bde

Clear cancelAction field after the future completes

321ccf8

- make cancelAction eligible for GC after the future completes (or gets cancelled)

Support cancel signalling when executing multiple futures

7203920

Shutdown BrokerService gracefully which using closeAsync

3ffe19a

- shutdown forcefully when closing times out

lhotari added 11 commits April 16, 2021 15:10

Set 100ms to brokerShutdownTimeoutMs used in tests

f83f318

Revert changes in MessagingServiceShutdownHook

f6aa6b3

Handle CancellationException since it's used in timeouts in BrokerSer…

5b4574a

…vice

Set shutdown timeout to 0 ms in tests

7a9cf0b

Ignore TimeoutException and CancellationException when broker shutdow…

6cefd7f

…n timeout is 0

Extract GracefulExecutorServicesShutdown and use it in PulsarService

833b866

- handle graceful / forcefully shutdown also for PulsarService executors

Fix some unclosed PulsarServices

fe5433e

Set shutdown timeout to 0 in some more tests

cc38f9e

Do some class and method renamings to clarify the code

9cc5212

Revisit the logic to use awaitTermination

92b3895

Use shutdownNow to shutdown the scheduler used for future timeout han…

e4ad678

…dling

lhotari force-pushed the lh-completely-shutdown-broker branch from 532e51c to e4ad678 Compare April 16, 2021 12:15

merlimat approved these changes Apr 16, 2021

View reviewed changes

merlimat merged commit 152d1e6 into apache:master Apr 16, 2021

merlimat mentioned this pull request Apr 20, 2021

Graceful shutdown stopped working on master #10289

Closed

sijie mentioned this pull request Apr 20, 2021

ISSUE-10289: Graceful shutdown stopped working on master streamnative/pulsar-archived#2420

Closed

BewareMyPower mentioned this pull request Jul 27, 2021

Speed up tests and fix flakiness streamnative/kop#628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

lhotari commented Apr 12, 2021 •

edited

Loading

lhotari commented Apr 12, 2021

michaeljmarshall left a comment

lhotari commented Apr 15, 2021

lhotari commented Apr 15, 2021

sijie commented Apr 15, 2021

lhotari commented Apr 15, 2021

merlimat left a comment

lhotari commented Apr 16, 2021

lhotari commented Apr 16, 2021

Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199

Conversation

lhotari commented Apr 12, 2021 • edited Loading

Motivation

Modifications

lhotari commented Apr 12, 2021

michaeljmarshall left a comment

Choose a reason for hiding this comment

lhotari commented Apr 15, 2021

lhotari commented Apr 15, 2021

sijie commented Apr 15, 2021

lhotari commented Apr 15, 2021

merlimat left a comment

Choose a reason for hiding this comment

lhotari commented Apr 16, 2021

lhotari commented Apr 16, 2021

lhotari commented Apr 12, 2021 •

edited

Loading