Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Speed up tests and fix flakiness #628

Merged

Conversation

BewareMyPower
Copy link
Collaborator

Motivation

The KoP CI tests take much more time than CI tests of branch-2.7.2.

The main reason is the cleanup() phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown.

The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's group.initial.rebalance.delay.ms. It makes Kafka server wait longer for JOIN_GROUP request for more consumers to join so that the rebalance count can reduce. However, it should be set zero in tests.

After fixing these problems, sometimes the following error may happen and cause flakiness.

TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration()
getConfiguration() should return ServiceConfiguration
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
   - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.

It's because PulsarService#newCompactor is not mocked well, see apache/pulsar#7102 for detail.

Modifications

  • Configure GroupInitialRebalanceDelayMs and BrokerShutdownTimeoutMs for each mocked BrokerService.
  • Fix the flakiness caused by mocking compactor.

After the changes, the tests time has a significant improvement.

For example, GroupCoordinatorTest takes only 3 minutes now but it could take 9 minutes before. Because the cleanup() method is marked as @AfterMethod and would be called each time a single test finished.

Another example is that BasicEndToEndKafkaTest takes only 37 seconds now but it could take 56 seconds before. The cleanup() is marked as @AfterClass and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds.

@BewareMyPower BewareMyPower self-assigned this Jul 27, 2021
@BewareMyPower BewareMyPower added the type/enhancement Indicates an improvement to an existing feature label Jul 27, 2021
@BewareMyPower
Copy link
Collaborator Author

In this PR:

kop tests / build (pull_request) Successful in 21m

However, it took 36 minutes in https://github.com/streamnative/kop/runs/3154656221

@BewareMyPower
Copy link
Collaborator Author

Now all tests passed, PTAL @jiazhai

@jiazhai jiazhai merged commit eee30f0 into streamnative:master Jul 29, 2021
@BewareMyPower BewareMyPower deleted the bewaremypower/speedup-tests branch July 29, 2021 04:29
BewareMyPower added a commit that referenced this pull request Aug 5, 2021
### Motivation

The KoP CI tests take much more time than CI tests of branch-2.7.2. 

The main reason is the `cleanup()` phase takes a long time, each time a test is cleaned up, it will take over 10 seconds to complete. This behavior was introduced from apache/pulsar#10199, which made broker shutdown gracefully by default but it would take longer to shutdown.

The other reason is caused by rebalance time. According to my observes, when a Kafka consumer subscribes a topic in KoP, it will take at least 3 seconds. Finally I found it's caused by the GroupInitialRebalanceDelayMs config, which has the same semantics with Kafka's [group.initial.rebalance.delay.ms](https://kafka.apache.org/documentation/#brokerconfigs_group.initial.rebalance.delay.ms). It makes Kafka server wait longer for `JOIN_GROUP` request for more consumers to join so that the rebalance count can reduce. However, it should be set zero  in tests.

After fixing these problems, sometimes the following error may happen and cause flakiness.

```
TwoPhaseCompactor$MockitoMock$1432102834 cannot be returned by getConfiguration()
getConfiguration() should return ServiceConfiguration
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
   - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.
```

It's because `PulsarService#newCompactor` is not mocked well, see apache/pulsar#7102 for detail.

### Modifications

- Configure `GroupInitialRebalanceDelayMs` and `BrokerShutdownTimeoutMs` for each mocked `BrokerService`.
- Fix the flakiness caused by mocking compactor.

After the changes, the tests time has a significant improvement.

For example, `GroupCoordinatorTest` takes only 3 minutes now but it could take 9 minutes before. Because the `cleanup()` method is marked as `@AfterMethod` and would be called each time a single test finished.

Another example is that `BasicEndToEndKafkaTest` takes only 37 seconds now but it could take 56 seconds before. The `cleanup()` is marked as `@AfterClass` and only happens once, but many consumers will be created during the whole tests and each time a subscribe call can take 3 seconds.

* Speed up tests and fix flakiness

* Speed up tests that creat their own configs

* Ignore golang-sarama tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/enhancement Indicates an improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants