-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][client] Orphan producer when concurrently calling producer closing and reconnection #23853
base: master
Are you sure you want to change the base?
[fix][client] Orphan producer when concurrently calling producer closing and reconnection #23853
Conversation
The PR title is currently very confusing when it's "Fix closed producers were reverted mistakenly". |
Sorry, it should be I corrected the title |
/pulsarbot rerun-failure-checks |
The title remains hard to understand. I used the technique described here to let an LLM (Claude) suggest a title. The suggestion based on the PR context is "Fix race condition between producer reconnection and closing that causes orphaned producers". Here's the full example of what LLM suggested: https://gist.github.com/lhotari/e63521b8a5694c5a928f740cd5d46331 . |
@poorbarcode Please also make updates in the description where it mentions the very misleading sentence "closed producers were reverted mistakenly". |
Modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
Nice catch!
@Test | ||
public void testClearPendingMessageWhenCloseAsync() { | ||
PulsarClientImpl client = mock(PulsarClientImpl.class); | ||
Mockito.doReturn(1L).when(client).newProducerId(); | ||
ClientConfigurationData clientConf = new ClientConfigurationData(); | ||
clientConf.setStatsIntervalSeconds(-1); | ||
Mockito.doReturn(clientConf).when(client).getConfiguration(); | ||
Mockito.doReturn(new InstrumentProvider(null)).when(client).instrumentProvider(); | ||
ConnectionPool connectionPool = mock(ConnectionPool.class); | ||
Mockito.doReturn(1).when(connectionPool).genRandomKeyToSelectCon(); | ||
Mockito.doReturn(connectionPool).when(client).getCnxPool(); | ||
HashedWheelTimer timer = mock(HashedWheelTimer.class); | ||
Mockito.doReturn(null).when(timer).newTimeout(Mockito.any(), Mockito.anyLong(), Mockito.any()); | ||
Mockito.doReturn(timer).when(client).timer(); | ||
ProducerConfigurationData producerConf = new ProducerConfigurationData(); | ||
producerConf.setSendTimeoutMs(-1); | ||
ProducerImpl<?> producer = Mockito.spy(new ProducerImpl<>(client, "topicName", producerConf, null, 0, null, null, Optional.empty())); | ||
|
||
// make sure throw exception when send request to broker | ||
ClientCnx clientCnx = mock(ClientCnx.class); | ||
CompletableFuture<ProducerResponse> tCompletableFuture = new CompletableFuture<>(); | ||
tCompletableFuture.completeExceptionally(new PulsarClientException("error")); | ||
when(clientCnx.sendRequestWithId(Mockito.any(), Mockito.anyLong())).thenReturn(tCompletableFuture); | ||
Mockito.doReturn(clientCnx).when(producer).cnx(); | ||
|
||
// run closeAsync and verify | ||
CompletableFuture<Void> voidCompletableFuture = producer.closeAsync(); | ||
verify(producer).closeAndClearPendingMessages(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the test case completely removed? Would it be possible to ensure with a test that the problem described in PR #23761 is addressed?
@@ -192,13 +193,12 @@ public void connectionClosed(ClientCnx cnx, Optional<Long> initialConnectionDela | |||
duringConnect.set(false); | |||
state.client.getCnxPool().releaseConnection(cnx); | |||
if (CLIENT_CNX_UPDATER.compareAndSet(this, cnx, null)) { | |||
if (!isValidStateForReconnection()) { | |||
if (!state.changeToConnecting()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously this would allow the connection to be in Ready
state while switching to Connecting
. Is it intentional that it's no longer allowed?
Motivation
background 1: producer's reconnection[1]
null
closing | closed
connecting
background 2: steps of producer closing[2]
connection
is null:closed
connection
is present:closed
Issue 1: resending messages encountered a recycled pending message
reconnection
close producer
null
closing
null
nowconnecting
close
Issue 2: closed producers were set to
connecting
mistakenly: the steps to reproduce the issue are as followsreconnection
close producer
null
closing
null
nowconnecting
close
You can reproduce the issue by
testConcurrencyReconnectAndClose
logs that encountered the issue 2
Modifications
state
atomically when callingreconnection
, see https://github.com/apache/pulsar/compare/master...poorbarcode:fix/producer_race_condition?expand=1#diff-bcd53f63180847515f1fe1d5b00deb218d023cbfe9cbfade19b44c2babd734ffR194-L196producer. pendingMessages
fails all pending sends when producer.connection is null
to reconnect successfully, which was introduced by [Fix][Client] Fix pending message not complete when closeAsync #23761Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: x