-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199
Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199
Conversation
This PRs continues the work started in PR #9308 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am new to the broker's clean shutdown, but based on reading through your PR, these changes look good to me other than single variable that I think might benefit from the volatile
keyword.
...common/src/main/java/org/apache/pulsar/common/util/CompletableFutureCancellationHandler.java
Show resolved
Hide resolved
dfc6747
to
32fa7d1
Compare
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/MessagingServiceShutdownHook.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/PulsarService.java
Outdated
Show resolved
Hide resolved
405c093
to
e1febb1
Compare
/pulsarbot run-failure-checks |
1 similar comment
/pulsarbot run-failure-checks |
e1febb1
to
79e800a
Compare
79e800a
to
4e7488a
Compare
@merlimat to review |
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 The approach LGTM. Just a couple of questions on the await termination
pulsar-broker/src/main/java/org/apache/pulsar/broker/PulsarService.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pulsar/broker/service/GracefulExecutorServicesShutdownHandler.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pulsar/broker/service/GracefulExecutorServicesShutdownHandler.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pulsar/broker/service/GracefulExecutorServicesShutdownHandler.java
Outdated
Show resolved
Hide resolved
@merlimat Thank you for your review and good points about await termination. |
- make cancelAction eligible for GC after the future completes (or gets cancelled)
- shutdown forcefully when closing times out
- handle graceful / forcefully shutdown also for PulsarService executors
532e51c
to
e4ad678
Compare
/pulsarbot run-failure-checks |
### Motivation The KoP CI tests take much more time than CI tests of branch-2.7.2. The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown. The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero in tests. After fixing these problems, sometimes the following error may happen and cause flakiness. ``` TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration() getConfiguration() should return ServiceConfiguration *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. ``` It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail. ### Modifications - Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`. - Fix the flakiness caused by mocking compactor. After the changes, the tests time has a significant improvement. For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished. Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds. * Speed up tests and fix flakiness * Speed up tests that creat their own configs * Ignore golang-sarama tests
### Motivation The KoP CI tests take much more time than CI tests of branch-2.7.2. The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown. The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero in tests. After fixing these problems, sometimes the following error may happen and cause flakiness. ``` TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration() getConfiguration() should return ServiceConfiguration *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. ``` It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail. ### Modifications - Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`. - Fix the flakiness caused by mocking compactor. After the changes, the tests time has a significant improvement. For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished. Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds. * Speed up tests and fix flakiness * Speed up tests that creat their own configs * Ignore golang-sarama tests
Motivation
Broker shutdown isn't currently graceful although the intention for it is to be graceful.
Another problem in tests is that broker shutdown doesn't shutdown synchronously.
These 2 problems are addressed in this PR.
The changes will also help improve CI stability when the asynchronous tasks of broker shutdown can be controlled in tests. This prevents problems which are caused by too many brokers being active at the same time. This could currently happen when previous brokers are asynchronously shutting down.
Modifications