Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

kop using pulsar consumer metrics #263

Merged
merged 4 commits into from
Feb 6, 2021

Conversation

dockerzhang
Copy link
Contributor

fix for #248

Copy link
Contributor Author

@dockerzhang dockerzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the first version for kop using pulsar consumer metrics.

@dockerzhang
Copy link
Contributor Author

@jiazhai @BewareMyPower PTAL

@BewareMyPower
Copy link
Collaborator

Please fix the license check first. I'll review this PR soon.

As for now, I just have a question, org.apache.pulsar.zookeeper, org.apache.bookkeeper.util both have ZKUtils. Is it necessary to implement the ZK utils again? And any change of utils should have tests to verify it no matter how minor it is.

@BewareMyPower BewareMyPower linked an issue Dec 11, 2020 that may be closed by this pull request
@BewareMyPower BewareMyPower added the type/feature Indicates new functionality label Dec 11, 2020
@dockerzhang
Copy link
Contributor Author

@BewareMyPower I will try to check if ZKUtils has the needed methods in org.apache.bookkeeper.util.

@dockerzhang
Copy link
Contributor Author

@BewareMyPower the PR which exposed updateStats with param is at apache/pulsar#8951

@BewareMyPower
Copy link
Collaborator

Got it.

@dockerzhang
Copy link
Contributor Author

@BewareMyPower apache/pulsar#8951 has been merged, how to trigger this PR using the latest method?

@BewareMyPower
Copy link
Collaborator

@zymap I think we may need to change the pulsar dependency to sn-pulsar. Could you give some help on @dockerzhang 's question?

@BewareMyPower
Copy link
Collaborator

@dockerzhang I'll release a self-maintained pulsar soon. Then we can use it as KoP's dependency release. Before it, I may release KoP 2.7.0 first to make KoP work with pulsar 2.7.0.

@BewareMyPower BewareMyPower added this to the 2.8.0-M1 milestone Dec 21, 2020
@BewareMyPower
Copy link
Collaborator

Now the latest pulsar dependency (2.8.0-rc-202012200040) is ready, please rebase to latest master.

@dockerzhang dockerzhang force-pushed the consumer-metric branch 3 times, most recently from f912533 to 1860bf9 Compare December 23, 2020 12:42
@BewareMyPower
Copy link
Collaborator

Please rebase to latest and some changes are needed. Because now we use EntryFormatter to do the encode/decode, see #280 . You need to decouple the entriesToRecords API in this PR to two methods that are EntryFormatter#decode and an extra method like updateStats:

public void updateStats(List<Entry> entries, Consumer consumer)

@dockerzhang
Copy link
Contributor Author

ok, I will add it tomorrow.

@dockerzhang dockerzhang force-pushed the consumer-metric branch 3 times, most recently from 9aea04a to b84e276 Compare December 29, 2020 05:07
@dockerzhang dockerzhang force-pushed the consumer-metric branch 3 times, most recently from 4cdd0d2 to d289c2f Compare January 28, 2021 12:12
@BewareMyPower
Copy link
Collaborator

BewareMyPower commented Feb 4, 2021

The GlobalKTableTest failed because KafkaStreams failed to start after this change. But KStreamAggregationTest succeeded to start the KafkaStreams.

I think it may take some time to figure out why it failed. I'll first take a look for a while.

@BewareMyPower
Copy link
Collaborator

I found the reason why GlobalKTableTest failed and gave two comments, PTAL. A simple fix could be ignoring the empty group id like

    public CompletableFuture<Consumer> getGroupConsumers(String groupId, TopicPartition kafkaPartition) {
        if (groupId == null || groupId.isEmpty()) {
            return CompletableFuture.completedFuture(null);
        }

But I think getting an empty group id is not a proper behavior.

After resolving the problem, you can rebase to latest to include PR #361 .

@BewareMyPower
Copy link
Collaborator

In addition, should we avoid public fields?

    public final ConcurrentHashMap<String, String> currentConnectedGroup;
    public final String groupIdStoredPath;

And the internal methods like getConsumers should also be private or make it visible only for package.

@dockerzhang
Copy link
Contributor Author

@BewareMyPower thanks for your help about this pr. the latest commit fixed next problem
1, rebased from the master
2, fix the group null issue to avoid GlobalKTableTest failed.
3, avoid public fields for currentConnectedGroup and groupIdStoredPath. but getConsumers will keep public because it used in KafkaTopicManager

@BewareMyPower
Copy link
Collaborator

Please rebase to master again since #368 fixes the broken kop integration tests.

@BewareMyPower BewareMyPower merged commit cda07d7 into streamnative:master Feb 6, 2021
BewareMyPower added a commit that referenced this pull request Apr 8, 2021
### Motivation
Sometimes multiple brokers could subscribe the same topic partition and fail with `ConsumerBusyException`:

```
WARN  io.streamnative.pulsar.handlers.kop.coordinator.group.OffsetAcker - Error when get consumer for offset ack:
  java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$ConsumerBusyException: Exclusive consumer is already connected
```

This problem was mentioned in #263 before. However, switching the subscription type to shared or failover is not correct because a partition should have only one consumer at the same time. The problem is before updating consumer metrics, it uses `OffsetAcker#getConsumer` to check if the consumer has been created. This check is wrong.

Firstly, `OffsetAcker#getConsumer` may create an offset consumer, while the actual offset consumer could have been created by another broker.
Secondly, it only checks if the future of consumer is done and returns immediately if not. It could cause a situation that this metrics update was skipped.

### Modifications

- Modify the condition check in `KafkaTopicConsumer#getGroupConsumers`. Because the offset consumer could be created by `GroupCoordinator#handleSyncGroup` before, here we just check if the consumer existed. Then update the consumer metrics when the future is completed.
- Remove the failed future of offset consumers.
- Add consumer stats tests to guarantee the modification works.
BewareMyPower added a commit to BewareMyPower/kop that referenced this pull request May 11, 2021
BewareMyPower added a commit to BewareMyPower/kop that referenced this pull request May 18, 2021
BewareMyPower added a commit that referenced this pull request May 21, 2021
Fixes #392

### Motivation

#105 introduced the `OffsetAcker` to reuse the backlog stat of Pulsar subscription. However, after #296 introduced the continuous offset, when the `OffsetAcker` want to find the `MessageId` or `Position`, which contains the ledger id and entry id of a batch, it needs to get the `PersistentTopic` using `BrokerService#getTopic`. If the current broker was not the owner broker of the partition, `getTopic` would fail.

However, Kafka consumers send the `OFFSET_COMMIT` requests to the coordinator broker, which is associated with the specified group name. A coordinator broker may not own the partition after the pulsar's topic ownership changed, so the `getTopic` would fail and couldn't be recovered, which affect the performance significantly and a lot of warnings logs would be printed. 

### Modifications

- Remove the `OffsetAcker` with the related metrics (`CoordinatorStats`) and tests (`CommitOffsetBacklogTest`)
- Remove the reused consumer stats from #263 with the related tests (`testKafkaConsumerMetrics`)
- Add `testMixedConsumersWithSameSubscription` to cover the case that Kafka consumer and Pulsar consumer subscribes the same topic with same subscription name. Because before this PR, Pulsar consumer were aware of the Kafka consumer's consumed position (committed offset).
- Fix the current tests:
    - `testDeleteTopics`: After this PR, the topic could be deleted even the Kafka consumers were still connected because no broker side Consumer were attached to the topic.
    - `testPulsarProduceKafkaConsume2`: It's a test for Kafka consumer with `enable.auto.commit=true`, however the validation is wrong because even a consumer hasn't committed any offset, the consumer can still consume from the latest offset.

Regarding to the cases that Kafka or Pulsar producer produces and Kafka or Pulsar consumer consumes, they are covered by existed `KafkaRequestTypeTest`. And the offset commit case is covered by `DifferentNamespaceTestBase#testCommitOffset`.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
triage/week-50 type/feature Indicates new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] kop reuse pulsar consumer state and metric
3 participants