[C++] Fix the race condition of connect timeout task #14823

BewareMyPower · 2022-03-23T17:50:34Z

Motivation

In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the createProducer or
subscribe methods will return ResultConnectError.

ClientConnection::connectTimeoutTask_, which is a shared pointer,
represents the timeout task. However, after ClientConnection::close is
called, the shared pointer will be reset, and the underlying PeriodicTask
object will be released. After that, when stop method is called on the
released PeriodicTask object in the callback (handleTcpConnected), a
segmentation fault will happen.

The root cause is that connectTimeoutTask_ can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See

pulsar/pulsar-client-cpp/python/pulsar_test.py

Lines 1207 to 1221 in f7cbc1e

    
           def test_connect_timeout(self): 
        
               client = pulsar.Client( 
        
                   service_url="pulsar://192.0.2.1:1234", 
        
                   connection_timeout_ms=1000,  # 1 second 
        
               ) 
        
               t1 = time.time() 
        
               try: 
        
                   producer = client.create_producer("test_connect_timeout") 
        
                   self.fail("create_producer should not succeed") 
        
               except pulsar.ConnectError as expected: 
        
                   print("expected error: {} when create producer".format(expected)) 
        
               t2 = time.time() 
        
               self.assertGreater(t2 - t1, 1.0) 
        
               self.assertLess(t2 - t1, 1.5)  # 1.5 seconds is long enough 
        
               client.close()

So this PR might also fix #14714.

Modifications

Remove connectTimeoutTask_.reset() in ClientConnection::close. After
that, the connectTimeoutTask_ will always points to the same
PeriodicTask object, whose methods are thread safe.

Verifying this change

Execute the following command

./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10

to runs the testConnectTimeout for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.

Fixes apache#14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See apache#14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix apache#14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail.

lhotari

Great work @BewareMyPower ! Thanks for finding out the reason behind so many flaky C++ and Python tests.

Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)

BewareMyPower · 2022-03-24T02:49:54Z

This bug was introduced from #14587, which was not cherry-picked into branch-2.8 and branch-2.9. I'll cherry-pick it first.

I found the flaky test on branch-2.9 is testReferenceCount (#14719), I'll work on this issue soon.

FAILED TESTS (2/274):
      37 ms: ./main ClientTest.testReferenceCount (try #1)
      31 ms: ./main ClientTest.testReferenceCount (try #2)

### Motivation apache#14823 fixes the flaky `testConnectTimeout` but it's also a regression of apache#14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

BewareMyPower · 2022-03-24T03:25:43Z

I just found this PR brings a regression of #14587 and I've opened another PR (#14834) to fix it.

### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)

Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)

### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)

Fixes #14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See #14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix #14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail. (cherry picked from commit 0c3aad1)

### Motivation #14823 fixes the flaky `testConnectTimeout` but it's also a regression of #14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`. (cherry picked from commit 54c368e)

Fixes apache#14665 ### Motivation In C++ client, a connect timeout task is created each time before an asynchronous connect operation is performed, if the connection cannot be established in the configured timeout, the callback of the task will be called to close the connection and then the `createProducer` or `subscribe` methods will return `ResultConnectError`. `ClientConnection::connectTimeoutTask_`, which is a shared pointer, represents the timeout task. However, after `ClientConnection::close` is called, the shared pointer will be reset, and the underlying `PeriodicTask` object will be released. After that, when `stop` method is called on the released `PeriodicTask` object in the callback (`handleTcpConnected`), a segmentation fault will happen. The root cause is that `connectTimeoutTask_` can be accessed in two threads while one of them could release the memory. See apache#14665 for more explanations. This race condition leads to flaky Python tests as well, because we also have the similar test in Python tests. See https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221 So this PR might also fix apache#14714. ### Modifications Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After that, the `connectTimeoutTask_` will always points to the same `PeriodicTask` object, whose methods are thread safe. ### Verifying this change Execute the following command ```bash ./tests/main --gtest_filter='ClientTest.testConnectTimeout' --gtest_repeat=10 ``` to runs the `testConnectTimeout` for 10 times. In my local env, it never failed, while before applying this patch, it's very easy to fail.

### Motivation apache#14823 fixes the flaky `testConnectTimeout` but it's also a regression of apache#14587. Because when the fd limit is reached, the `connectionTimeoutTask_` won't be initialized with a non-null value. Calling `stop` method on it directly will cause segmentation fault. See https://github.com/apache/pulsar/blob/0fe921f32cefe7648ca428cd9861f9163c69767d/pulsar-client-cpp/lib/ClientConnection.cc#L178-L185 ### Modifications Add the null check for `connectionTimeoutTask_` in `ClientConnection::close`.

BewareMyPower added component/client-c++ type/flaky-tests doc-not-needed Your PR changes do not impact docs release/2.9.3 release/2.8.4 release/2.10.1 labels Mar 23, 2022

BewareMyPower requested review from merlimat, lhotari, rdhabalia, jiazhai and aahmed-se March 23, 2022 17:50

BewareMyPower self-assigned this Mar 23, 2022

BewareMyPower mentioned this pull request Mar 23, 2022

cpp-tests job is flaky: python pulsar_test.py fails with Segmentation fault #14714

Closed

merlimat added this to the 2.11.0 milestone Mar 23, 2022

merlimat added the type/bug The PR fixed a bug or issue reported a bug label Mar 23, 2022

merlimat approved these changes Mar 23, 2022

View reviewed changes

lhotari approved these changes Mar 23, 2022

View reviewed changes

michaeljmarshall approved these changes Mar 23, 2022

View reviewed changes

lhotari merged commit 0c3aad1 into apache:master Mar 23, 2022

BewareMyPower deleted the bewaremypower/cpp-flaky-connect-timeout branch March 24, 2022 02:38

BewareMyPower added cherry-picked/branch-2.10 and removed release/2.9.3 release/2.8.4 labels Mar 24, 2022

BewareMyPower mentioned this pull request Mar 24, 2022

[C++] Fix segmentation fault when creating socket failed #14834

Merged

BewareMyPower added release/2.9.3 release/2.8.4 labels Mar 24, 2022

BewareMyPower added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Mar 24, 2022

BewareMyPower added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Mar 24, 2022

lhotari mentioned this pull request Mar 28, 2022

lh github actions workflow refactoring 2022 artifacts lhotari/pulsar#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Fix the race condition of connect timeout task #14823

[C++] Fix the race condition of connect timeout task #14823

BewareMyPower commented Mar 23, 2022

lhotari left a comment

BewareMyPower commented Mar 24, 2022

BewareMyPower commented Mar 24, 2022

	def test_connect_timeout(self):
	client = pulsar.Client(
	service_url="pulsar://192.0.2.1:1234",
	connection_timeout_ms=1000, # 1 second
	)
	t1 = time.time()
	try:
	producer = client.create_producer("test_connect_timeout")
	self.fail("create_producer should not succeed")
	except pulsar.ConnectError as expected:
	print("expected error: {} when create producer".format(expected))
	t2 = time.time()
	self.assertGreater(t2 - t1, 1.0)
	self.assertLess(t2 - t1, 1.5) # 1.5 seconds is long enough
	client.close()

[C++] Fix the race condition of connect timeout task #14823

[C++] Fix the race condition of connect timeout task #14823

Conversation

BewareMyPower commented Mar 23, 2022

Motivation

Modifications

Verifying this change

lhotari left a comment

Choose a reason for hiding this comment

BewareMyPower commented Mar 24, 2022

BewareMyPower commented Mar 24, 2022