Fix KoP will cause Kafka Errors.REQUEST_TIMED_OUT when consume multi TopicPartition in one consumer request #654

wenbingshen · 2021-08-14T23:55:49Z

Fixes #604

Motivation

When consume multi TopicPartition in one request, in MessageFetchContext.handleFetch(), tryComplete() may enter race condition
When one topicPartition removed from responseData, maybe removed again, which will cause Errors.REQUEST_TIMED_OUT

ReadEntries and CompletableFuture.complete operations for each partition are all performed by BookKeeperClientWorker- Different threads in the OrderedExecutor thread pool are executed. When the partition can read data, because the read data and decode operations will take uncertain time, the competition in this case is relatively weak; and when the partition has no data to write, and the consumer After all the data has been consumed, I have been making empty fetch requests, which can be reproduced stably at this time.
Stable steps to reproduce:

A single broker has two partition leaders for one topic;
The topic is not writing data, and consumers have consumed the old data;
At this time, the consumer client continues to send Fetch requests to broker;
Basically, you will soon see that the server returns error_code=7, and the client will down。
1、One fetch request, two partitions, and two threads. The data obtained is an empty set without any protocol conversion operation.
2、When the BookKeeperClientWorker-OrderedExecutor-25-0 thread adds test_kop_222-1 to the responseData, BookKeeperClientWorker-OrderedExecutor-23- 0 thread adds test_kop_222-3 to responseData,
3、at this time responseData.size() >= fetchRequest.fetchData().size(), because tryComplete has no synchronization operation, two threads enter at the same time,
4、fetchRequest.fetchData().keySet() .forEach two threads traverse at the same time, resulting in the same partition multiple times responseData.remove(topicPartition), partitionData is null and cause the REQUEST_TIMED_OUT error.

Modifications

MessageFetchContext.tryComplete add synchronization lock

…Context.handleFetch(), tryComplete() may enter race condition When one topicPartition removed from responseData, maybe removed again, which will cause Errors.REQUEST_TIMED_OUT

wenbingshen · 2021-08-15T06:13:19Z

@BewareMyPower Hi, Yunze brother, how about your think of this issue? PTAL. :)

tests/src/test/java/io/streamnative/pulsar/handlers/kop/MessageFetchContextTest.java

BewareMyPower

Great find. Just leave some suggestions.

We use TestNG in KoP, please replace JUnit with TestNG.
MessageFetchContextTest doesn't rely on the KopProtocolHandlerTestBase, so it should be put under kafka-impl/tests instead of the tests/ directory.
MessageFetchContextTest doesn't test MessageFetchContext directly, it looks like a simulation. i.e, if MessageFetchContext was modified in future, this test would still pass so it couldn't protect the code.

wenbingshen · 2021-08-15T09:35:36Z

@BewareMyPower Thanks for your review very much. I will modify it to TestNG and move it to the correct directory, I will consider modifying the test cases to protect the code. Thanks. :)

wenbingshen · 2021-08-15T14:24:02Z

@BewareMyPower Hi, Yunze brother, I have addressed your comment. PTAL :)

BewareMyPower · 2021-08-16T01:04:45Z

It looks like your test cannot pass, see kop mvn build check and kafka-impl test / build (pull_request).

[INFO] Running io.streamnative.pulsar.handlers.kop.MessageFetchContextTest
Error: The operation was canceled.

BewareMyPower · 2021-08-16T01:23:46Z

If it's not convenient to add the test, it will be acceptable to not add the test.

The race condition happens when complete() method is called by two threads. It's because responseData.size() and responseData.add() are not an atomic operation.

wenbingshen · 2021-08-16T02:52:11Z

@BewareMyPower
Take two partitions as an example, after tryComplete adds the synchronization lock,
The first situation:
If after the first thread responseData.put, the second thread immediately follows responseData.put, and the first thread just enters tryComplete, the judgment (responseData.size() >= fetchRequest.fetchData().size()) condition is satisfied , The first thread enters complete, and finally recycles many object resources, which will set responseData to null, and the second thread enters tryComplete will cause a null pointer exception;
The second case:
If after the first thread responseData.put, the first thread enters tryComplete first, and the judgment (responseData.size() >= fetchRequest.fetchData().size()) condition is not met, the first thread will not enter complete , And at this time the second thread responseData.put, (responseData.size() >= fetchRequest.fetchData().size()) conditions are met, and complete is executed, and everything is executed normally at this time.

Therefore, we need to make a null pointer judgment on responseData in tryComplete.

…micBoolean to avoid thread blocking & fix test failed

BewareMyPower · 2021-08-16T03:57:11Z

You're right, I think

        if (resultFuture == null) {
            // the context has been recycled
            return;
        }

in complete() method should be moved to tryComplete().

kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/MessageFetchContext.java

…ption

wenbingshen · 2021-08-16T04:56:46Z

@BewareMyPower for Codacy Static Code Analysis error, what should I do, please teach me. :)

wenbingshen · 2021-08-16T05:44:52Z

@BewareMyPower All checks have passed, PTAL :)

BewareMyPower · 2021-08-16T06:43:33Z

The test can be simplified. MessageFetchContextTest compares MessageFetchContext and MessageFetchContextTest itself. When isSafe is false, addErrorPartitionResponse only operates on the fields of MessageFetchContextTest, this behavior is not related to any code.

IMO, only testHandleFetchSafe is required because if MessageFetchContext is not thread safe, this test will fail. We don't need to add a contrast implementation to show what the wrong implementation is.

wenbingshen · 2021-08-16T07:09:18Z

The test can be simplified. MessageFetchContextTest compares MessageFetchContext and MessageFetchContextTest itself. When isSafe is false, addErrorPartitionResponse only operates on the fields of MessageFetchContextTest, this behavior is not related to any code.

IMO, only testHandleFetchSafe is required because if MessageFetchContext is not thread safe, this test will fail. We don't need to add a contrast implementation to show what the wrong implementation is.

@BewareMyPower Gotcha! I have addressed your comment. PTAL

BewareMyPower · 2021-08-16T07:28:35Z

I still have questions about the test. When I modified the tryComplete() back to the original implementation:

    private void tryComplete() {
        if (responseData.size() >= fetchRequest.fetchData().size()) {
            complete();
        }
    }

The test can still pass in my local env.

wenbingshen · 2021-08-16T07:42:51Z

@BewareMyPower I set invocationCount = 5000 locally and copied REQUEST_TIMED_OUT once and this.responseData is empty three times. I didn’t set invocationCount = 5000 in pr because I think if the code has concurrency problems, it will definitely become a flaky test in future code submissions. , What do you think should be done?

BewareMyPower · 2021-08-16T07:45:02Z

Okay, it makes sense.

…TopicPartition in one consumer request (#654) Fixes #604 ### Motivation When consume multi TopicPartition in one request, in MessageFetchContext.handleFetch(), tryComplete() may enter race condition When one topicPartition removed from responseData, maybe removed again, which will cause Errors.REQUEST_TIMED_OUT ReadEntries and CompletableFuture.complete operations for each partition are all performed by BookKeeperClientWorker- Different threads in the OrderedExecutor thread pool are executed. When the partition can read data, because the read data and decode operations will take uncertain time, the competition in this case is relatively weak; and when the partition has no data to write, and the consumer After all the data has been consumed, I have been making empty fetch requests, which can be reproduced stably at this time. Stable steps to reproduce: A single broker has two partition leaders for one topic; The topic is not writing data, and consumers have consumed the old data; At this time, the consumer client continues to send Fetch requests to broker; Basically, you will soon see that the server returns error_code=7, and the client will down。 1、One fetch request, two partitions, and two threads. The data obtained is an empty set without any protocol conversion operation. 2、When the BookKeeperClientWorker-OrderedExecutor-25-0 thread adds test_kop_222-1 to the responseData, BookKeeperClientWorker-OrderedExecutor-23- 0 thread adds test_kop_222-3 to responseData, 3、at this time responseData.size() >= fetchRequest.fetchData().size(), because tryComplete has no synchronization operation, two threads enter at the same time, 4、fetchRequest.fetchData().keySet() .forEach two threads traverse at the same time, resulting in the same partition multiple times responseData.remove(topicPartition), partitionData is null and cause the REQUEST_TIMED_OUT error. ![image](https://user-images.githubusercontent.com/35599757/129462871-37fbfc6f-1603-4da8-9815-95a278195936.png) ### Modifications `MessageFetchContext.tryComplete` add synchronization lock

Fix when consume multi TopicPartition in one request, in MessageFetch…

ea67c6c

…Context.handleFetch(), tryComplete() may enter race condition When one topicPartition removed from responseData, maybe removed again, which will cause Errors.REQUEST_TIMED_OUT

wenbingshen requested review from BewareMyPower and jiazhai as code owners August 14, 2021 23:55

wenbingshen added 2 commits August 15, 2021 08:00

add License to Test file

4be1295

fix MessageFetchContextTest.testHandleFetchUnSafe flaky test

44fd7dd

BewareMyPower suggested changes Aug 15, 2021

View reviewed changes

tests/src/test/java/io/streamnative/pulsar/handlers/kop/MessageFetchContextTest.java Outdated Show resolved Hide resolved

BewareMyPower suggested changes Aug 15, 2021

View reviewed changes

wenbingshen added 2 commits August 15, 2021 22:15

replace JUnit with TestNG and address reviewer's comment

e01c8f3

fix License check error

4b04f77

wenbingshen requested a review from BewareMyPower August 15, 2021 14:21

fix responseData null point exception & replace synchronized with Ato…

3c2abc7

…micBoolean to avoid thread blocking & fix test failed

BewareMyPower suggested changes Aug 16, 2021

View reviewed changes

kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/MessageFetchContext.java Outdated Show resolved Hide resolved

replace responseData with resultFuture to judge the null pointer exce…

29b179d

…ption

wenbingshen requested a review from BewareMyPower August 16, 2021 04:56

set if statements combined

a828031

Simplify the MessageFetchContextTest

ace004a

BewareMyPower approved these changes Aug 16, 2021

View reviewed changes

BewareMyPower merged commit df436e3 into streamnative:master Aug 16, 2021

BewareMyPower assigned wenbingshen Aug 16, 2021

BewareMyPower added the type/bug label Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KoP will cause Kafka Errors.REQUEST_TIMED_OUT when consume multi TopicPartition in one consumer request #654

Fix KoP will cause Kafka Errors.REQUEST_TIMED_OUT when consume multi TopicPartition in one consumer request #654

wenbingshen commented Aug 14, 2021

wenbingshen commented Aug 15, 2021

BewareMyPower left a comment

wenbingshen commented Aug 15, 2021

wenbingshen commented Aug 15, 2021

BewareMyPower commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

Fix KoP will cause Kafka Errors.REQUEST_TIMED_OUT when consume multi TopicPartition in one consumer request #654

Fix KoP will cause Kafka Errors.REQUEST_TIMED_OUT when consume multi TopicPartition in one consumer request #654

Conversation

wenbingshen commented Aug 14, 2021

Motivation

Modifications

wenbingshen commented Aug 15, 2021

BewareMyPower left a comment

Choose a reason for hiding this comment

wenbingshen commented Aug 15, 2021

wenbingshen commented Aug 15, 2021

BewareMyPower commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021

wenbingshen commented Aug 16, 2021

BewareMyPower commented Aug 16, 2021