Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition of OpSendMsgQueue when publishing messages #14231

Conversation

codelipenghui
Copy link
Contributor

@codelipenghui codelipenghui commented Feb 11, 2022

Background

The issue is a race condition related to #11884 and #12704

When call getPendingQueueSize() which will call pendingMessages.foreach() but without a lock protection. After the send receipt come back, peek from the pendingMessages but which might get null here during the foreachprocess. I have added some logs to confirm the issue, and to reproduce the issue better to use use.statsInterval(1, TimeUnit.SECONDS)` to let the stats update more frequently.

Here are the logs which can explain the bug:

2022-02-11T12:40:00,571+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0
2022-02-11T12:40:00,572+0800 [pulsar-timer-5-1] INFO  org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - [public/default/t_topic] [standalone-0-3] Pending messages: 1 --- Publish throughput: 0.99 msg/s --- 0.00 Mbit/s --- Latency: med: 0.000 ms - 95pct: 0.000 ms - 99pct: 0.000 ms - 99.9pct: 0.000 ms - max: -∞ ms --- Ack received rate: 0.00 ack/s --- Failed messages: 0
2022-02-11T12:40:01,564+0800 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - Add message to OpSendMsgQueue.postponedOpSendMgs, 1, 182801
2022-02-11T12:40:01,566+0800 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [public/default/s_topic] [standalone-0-6] Got ack for timed out msg 182801 - 182900
2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0
2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - Put the opsend back to deque of sequenceID 182801

From the logs, you can see a message with sequence ID 182801 add the OpSendMsgQueue first, but after the producer received the receipt, the log shows Got ack for timed out msg which means got null when peeking messages, and after, the message add back to the internal queue, but the producer side is blocked at this time.

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

@codelipenghui codelipenghui self-assigned this Feb 11, 2022
@codelipenghui codelipenghui added this to the 2.10.0 milestone Feb 11, 2022
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 11, 2022
@codelipenghui codelipenghui added the release/blocker Indicate the PR or issue that should block the release until it gets resolved label Feb 11, 2022
@codelipenghui codelipenghui force-pushed the penghui/fix-pending-message-race-condition branch from 10b88f5 to dca2efa Compare February 11, 2022 07:50
@codelipenghui codelipenghui changed the title Fix race condtion of OpSendMsgQueue which publishing messages Fix race condition of OpSendMsgQueue when publishing messages Feb 11, 2022
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue implementation that backs pendingMessages isn't thread safe.


Isn't the correct solution to make getPendingQueueSize method synchronized? There's a similar thread safety issue in getDelayInMillis method which should be synchronized.

@codelipenghui
Copy link
Contributor Author

codelipenghui commented Feb 11, 2022

@lhotari Yes, I have tried to synchronize the getPendingQueueSize, but looks like we don't have to synchronize here because we can accept the incorrect pending messages at a moment? so we can avoid the print stats task will hold the lock if there are many pending messages.

And yes, getDelayInMillis also have the thread safety issue, but looks only affects the accuracy of replication delay.

@lhotari
Copy link
Member

lhotari commented Feb 11, 2022

looks like we don't have to synchronize here because we can accept the incorrect pending messages at a moment?

That is risky. It's undefined what the behavior of ArrayDeque would be if it's accessed without proper synchronization. Usually it would be "just" a data consistency issue. In the case of java.util.HashMap, concurrent modifications and access could lead to an infinite loop before Java 8.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use the synchronized keyword to ensure correctness. If we don't synchronize our access and we continue to use the forEach method, I think we need to modify the implementation of OpSendMsgQueue#forEach. As it is currently implemented, concurrent calls to forEach could lead to different views of the forEachDepth variable, which might lead to a postponedOpSendMgs that is never removed.

If we really want to avoid adding the synchronized keyword, we could use a LongAdder in the OpSendMsgQueue. Then, anytime an object is added or removed, we can increment/decrement accordingly, and the getPendingQueueSize will return the sum of the LongAdder.

One advantage of using the synchronized keyword is that it only affects producers that are running with stats logging or applications that are calling the getPendingQueueSize method.

@merlimat
Copy link
Contributor

If we really want to avoid adding the synchronized keyword, we could use a LongAdder in the OpSendMsgQueue. Then, anytime an object is added or removed, we can increment/decrement accordingly, and the getPendingQueueSize will return the sum of the LongAdder.

I agree that it would be better to have a messageCount method in OpSendMsgQueue, so that we can encapsulate the logic in there (and we don't have to deal with batching enabled/disabled cases either).

@merlimat
Copy link
Contributor

@codelipenghui @lhotari @michaeljmarshall Changed the logic to use counter. PTAL.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1743,7 +1748,7 @@ private void resendMessages(ClientCnx cnx, long expectedEpoch) {
cnx.channel().close();
return;
}
int messagesToResend = pendingMessages.size();
long messagesToResend = pendingMessages.messagesCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: technically, users might notice a change in the Re-Sending log line below when running with batch messaging enabled. I don't think this is a problem, but I wanted to mention the behavior change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was a left-over. fixed

Copy link
Member

@michaeljmarshall michaeljmarshall Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I was too vague. I just meant that the log line's content will be different. Instead of logging X batch messages, it'll log the total number of messages (regardless of batching). This is fine, but it's a change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though the message is more accurate now compared to before :)

@codelipenghui
Copy link
Contributor Author

LGTM

@codelipenghui
Copy link
Contributor Author

@michaeljmarshall We don't need this commit to 2.8.3, #12704 is only in branch-2.9. 2.8.x not affected by this problem

@codelipenghui
Copy link
Contributor Author

@lhotari Please help review again, thanks.

@merlimat merlimat merged commit 2c9af46 into apache:master Feb 12, 2022
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant solution, very good!

merlimat added a commit that referenced this pull request Feb 12, 2022
* Add synchronized for getPendingQueueSize();

* Use iterator instead.

* Use counter to keep track of messages count

* Changed to int

Co-authored-by: Matteo Merli <mmerli@apache.org>
@merlimat merlimat added cherry-picked/branch-2.9 Archived: 2.9 is end of life and removed release/2.8.3 labels Feb 12, 2022
@codelipenghui codelipenghui deleted the penghui/fix-pending-message-race-condition branch February 13, 2022 10:22
@michaeljmarshall
Copy link
Member

@michaeljmarshall We don't need this commit to 2.8.3, #12704 is only in branch-2.9. 2.8.x not affected by this problem

@codelipenghui - I agree that the specific bug that led to this fix is not in 2.8. However, in 2.8, we access ArrayDequeue from multiple threads. The data structure is explicitly labeled as not being thread safe. The solution in this PR would remove the unsafe access in branch-2.8. As such, I think we should cherry-pick to 2.8.

@michaeljmarshall
Copy link
Member

@codelipenghui - I discussed this with @lhotari offline, and he pointed out that we've used the current paradigm for a while. I will not cherry pick this commit to branch-2.8. Please ignore my above comment.

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
…#14231)

* Add synchronized for getPendingQueueSize();

* Use iterator instead.

* Use counter to keep track of messages count

* Changed to int

Co-authored-by: Matteo Merli <mmerli@apache.org>
@codelipenghui codelipenghui restored the penghui/fix-pending-message-race-condition branch May 17, 2022 01:20
@codelipenghui codelipenghui deleted the penghui/fix-pending-message-race-condition branch May 17, 2022 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/blocker Indicate the PR or issue that should block the release until it gets resolved release/2.9.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants