Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Fix the race condition of connect timeout task #14823

Merged

Conversation

BewareMyPower
Copy link
Contributor

Fixes #14665

Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the createProducer or
subscribe methods will return ResultConnectError.

ClientConnection::connectTimeoutTask_, which is a shared pointer,
represents the timeout task. However, after ClientConnection::close is
called, the shared pointer will be reset, and the underlying PeriodicTask
object will be released. After that, when stop method is called on the
released PeriodicTask object in the callback (handleTcpConnected), a
segmentation fault will happen.

The root cause is that connectTimeoutTask_ can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See

def test_connect_timeout(self):
client = pulsar.Client(
service_url="pulsar://192.0.2.1:1234",
connection_timeout_ms=1000, # 1 second
)
t1 = time.time()
try:
producer = client.create_producer("test_connect_timeout")
self.fail("create_producer should not succeed")
except pulsar.ConnectError as expected:
print("expected error: {} when create producer".format(expected))
t2 = time.time()
self.assertGreater(t2 - t1, 1.0)
self.assertLess(t2 - t1, 1.5) # 1.5 seconds is long enough
client.close()

So this PR might also fix #14714.

Modifications

Remove connectTimeoutTask_.reset() in ClientConnection::close. After
that, the connectTimeoutTask_ will always points to the same
PeriodicTask object, whose methods are thread safe.

Verifying this change

Execute the following command

./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10

to runs the testConnectTimeout for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.

Fixes apache#14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See apache#14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix apache#14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @BewareMyPower ! Thanks for finding out the reason behind so many flaky C++ and Python tests.

@lhotari lhotari merged commit 0c3aad1 into apache:master Mar 23, 2022
@BewareMyPower BewareMyPower deleted the bewaremypower/cpp-flaky-connect-timeout branch March 24, 2022 02:38
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
Fixes #14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix #14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.

(cherry picked from commit 0c3aad1)
@BewareMyPower
Copy link
Contributor Author

This bug was introduced from #14587, which was not cherry-picked into branch-2.8 and branch-2.9. I'll cherry-pick it first.

I found the flaky test on branch-2.9 is testReferenceCount (#14719), I'll work on this issue soon.

FAILED TESTS (2/274):
      37 ms: ./main ClientTest.testReferenceCount (try #1)
      31 ms: ./main ClientTest.testReferenceCount (try #2)

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Mar 24, 2022
### Motivation

apache#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
apache#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
@BewareMyPower
Copy link
Contributor Author

I just found this PR brings a regression of #14587 and I've opened another PR (#14834) to fix it.

BewareMyPower added a commit that referenced this pull request Mar 24, 2022
### Motivation

#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
### Motivation

#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

(cherry picked from commit 54c368e)
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
Fixes #14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix #14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.

(cherry picked from commit 0c3aad1)
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
### Motivation

#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

(cherry picked from commit 54c368e)
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Mar 24, 2022
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
Fixes #14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix #14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.

(cherry picked from commit 0c3aad1)
BewareMyPower added a commit that referenced this pull request Mar 24, 2022
### Motivation

#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

(cherry picked from commit 54c368e)
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Mar 24, 2022
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
Fixes apache#14665

### Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.

`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.

The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See apache#14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221

So this PR might also fix apache#14714.

### Modifications

Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.

### Verifying this change

Execute the following command

```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10
```

to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
### Motivation

apache#14823 fixes the flaky
`testConnectTimeout` but it's also a regression of
apache#14587. Because when the fd limit
is reached, the `connectionTimeoutTask_` won't be initialized with a
non-null value. Calling `stop` method on it directly will cause
segmentation fault.

See
https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185

### Modifications

Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.8.4 release/2.9.3 release/2.10.1 type/bug The PR fixed a bug or issue reported a bug type/flaky-tests
Projects
None yet
4 participants