-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Fix the race condition of connect timeout task #14823
[C++] Fix the race condition of connect timeout task #14823
Conversation
Fixes apache#14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See apache#14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix apache#14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @BewareMyPower ! Thanks for finding out the reason behind so many flaky C++ and Python tests.
Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)
This bug was introduced from #14587, which was not cherry-picked into branch-2.8 and branch-2.9. I'll cherry-pick it first. I found the flaky test on branch-2.9 is
|
### Motivation apache#14823 fixes the flaky `testConnectTimeout` but it's also a regression of apache#14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)
Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)
### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)
Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)
### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)
Fixes apache#14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See apache#14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix apache#14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail.
### Motivation apache#14823 fixes the flaky `testConnectTimeout` but it's also a regression of apache#14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
Fixes #14665
Motivation
In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the
createProducer
orsubscribe
methods will returnResultConnectError
.ClientConnection::connectTimeoutTask_
, which is a shared pointer,represents the timeout task. However, after
ClientConnection::close
iscalled, the shared pointer will be reset, and the underlying
PeriodicTask
object will be released. After that, when
stop
method is called on thereleased
PeriodicTask
object in the callback (handleTcpConnected
), asegmentation fault will happen.
The root cause is that
connectTimeoutTask_
can be accessed in twothreads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
pulsar/pulsar-client-cpp/python/pulsar_test.py
Lines 1207 to 1221 in f7cbc1e
So this PR might also fix #14714.
Modifications
Remove
connectTimeoutTask_.reset()
inClientConnection::close
. Afterthat, the
connectTimeoutTask_
will always points to the samePeriodicTask
object, whose methods are thread safe.Verifying this change
Execute the following command
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
to runs the
testConnectTimeout
for 10 times. In my local env, it neverfailed, while before applying this patch, it's very easy to fail.