[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647

codelipenghui · 2023-06-26T04:20:40Z

Motivation

This issue has occurred with a topic that has many producers, where the P99 publish latency of the broker will increase to hundreds of milliseconds when many producers connect or disconnect from the topic. This level of latency is unacceptable for a messaging system.

In this case, each producer add or remove operation goes to MessageDeduplication to update a map for inactive producers, regardless of whether message deduplication is enabled or disabled. My initial impression is that if message deduplication is disabled, it should not go to MessageDeduplication. However, users can enable message deduplication for an active topic, which may be the reason why this is happening. @merlimat, do you have any additional context on this matter? We may also need to find a solution to avoid any operations to MessageDeduplication if message deduplication is disabled in the future.

This PR provides a fix without introducing any changes in behavior, which will give us more confidence to cherry-pick it to release branches.

broker_lock_0621_1.html.txt

Modifications

Use ConcurrentHashMap instead of synchronized HashMap to reduce the contention between IO threads.

Here is the benchmark result from benchmark test
CLHM_CON (ConcurrentOpenHashMap)
CHM_CON(ConcurrentHashMap)
HM_CON(SynchronizedHashMap)

CLHM_CON::Thread-3::put: 422ms
CLHM_CON::Thread-5::put: 422ms
CLHM_CON::Thread-4::put: 422ms
CLHM_CON::Thread-0::put: 423ms
CLHM_CON::Thread-2::put: 422ms
CLHM_CON::Thread-1::put: 423ms
CLHM_CON::Thread-0::get: 786ms
CLHM_CON::Thread-1::get: 805ms
CLHM_CON::Thread-4::get: 839ms
CLHM_CON::Thread-5::get: 840ms
CLHM_CON::Thread-3::get: 854ms
CLHM_CON::Thread-2::get: 855ms
CLHM_CON::Thread-0::remove: 112ms
CLHM_CON::Thread-1::remove: 123ms
CLHM_CON::Thread-4::remove: 97ms
CLHM_CON::Thread-5::remove: 97ms
CLHM_CON::Thread-2::remove: 84ms
CLHM_CON::Thread-3::remove: 85ms
CLHM_CON: 1367 ms
CHM_CON::Thread-7::put: 33ms
CHM_CON::Thread-6::put: 35ms
CHM_CON::Thread-9::put: 37ms
CHM_CON::Thread-8::put: 38ms
CHM_CON::Thread-11::put: 38ms
CHM_CON::Thread-10::put: 39ms
CHM_CON::Thread-9::get: 630ms
CHM_CON::Thread-10::get: 629ms
CHM_CON::Thread-7::get: 635ms
CHM_CON::Thread-11::get: 630ms
CHM_CON::Thread-8::get: 631ms
CHM_CON::Thread-6::get: 635ms
CHM_CON::Thread-10::remove: 11ms
CHM_CON::Thread-6::remove: 9ms
CHM_CON::Thread-7::remove: 11ms
CHM_CON::Thread-8::remove: 10ms
CHM_CON::Thread-11::remove: 11ms
CHM_CON::Thread-9::remove: 12ms
CHM_CON: 680 ms
HM_CON::Thread-16::put: 643ms
HM_CON::Thread-15::put: 664ms
HM_CON::Thread-14::put: 670ms
HM_CON::Thread-13::put: 674ms
HM_CON::Thread-17::put: 687ms
HM_CON::Thread-12::put: 715ms
HM_CON::Thread-16::get: 61876ms
HM_CON::Thread-15::get: 62214ms
HM_CON::Thread-13::get: 62312ms
HM_CON::Thread-17::get: 62309ms
HM_CON::Thread-14::get: 62450ms
HM_CON::Thread-16::remove: 638ms
HM_CON::Thread-15::remove: 555ms
HM_CON::Thread-13::remove: 521ms
HM_CON::Thread-17::remove: 522ms
HM_CON::Thread-14::remove: 411ms
HM_CON::Thread-12::get: 62816ms
HM_CON::Thread-12::remove: 8ms
HM_CON: 63538 ms

Verifying this change

The existing tests can cover the new changes.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

codelipenghui · 2023-06-26T04:21:57Z

/pulsarbot run-failure-checks

lhotari

LGTM

mattisonchao · 2023-06-26T07:41:23Z

@codelipenghui

https://github.com/apache/pulsar/actions/runs/5373765736/jobs/9748615177?pr=20647 Please help check this failed job.

eolivelli · 2023-06-26T07:46:09Z

...r-broker/src/main/java/org/apache/pulsar/broker/service/persistent/MessageDeduplication.java

@@ -455,23 +455,17 @@ public synchronized void producerRemoved(String producerName) {
    public synchronized void purgeInactiveProducers() {


do we need to keep this 'synchronized' ?

I think we don't need it, and I also want to remove it in this PR initially. But considering it is not related to the publish latency spike issue, maybe we'd better just keep this PR only for the publish latency spike issue. And I can create the following PR to remove this synchronized, but no need to cherry-pick to release branches.

codelipenghui · 2023-06-26T11:17:53Z

After doing a benchmark test
We should use ConcurrentHashMap here, it performed super better than ConcurrentOpenHashMap and synchronized HashMap for put/remove operations.

CLHM_CON::Thread-3::put: 422ms
CLHM_CON::Thread-5::put: 422ms
CLHM_CON::Thread-4::put: 422ms
CLHM_CON::Thread-0::put: 423ms
CLHM_CON::Thread-2::put: 422ms
CLHM_CON::Thread-1::put: 423ms
CLHM_CON::Thread-0::get: 786ms
CLHM_CON::Thread-1::get: 805ms
CLHM_CON::Thread-4::get: 839ms
CLHM_CON::Thread-5::get: 840ms
CLHM_CON::Thread-3::get: 854ms
CLHM_CON::Thread-2::get: 855ms
CLHM_CON::Thread-0::remove: 112ms
CLHM_CON::Thread-1::remove: 123ms
CLHM_CON::Thread-4::remove: 97ms
CLHM_CON::Thread-5::remove: 97ms
CLHM_CON::Thread-2::remove: 84ms
CLHM_CON::Thread-3::remove: 85ms
CLHM_CON: 1367 ms
CHM_CON::Thread-7::put: 33ms
CHM_CON::Thread-6::put: 35ms
CHM_CON::Thread-9::put: 37ms
CHM_CON::Thread-8::put: 38ms
CHM_CON::Thread-11::put: 38ms
CHM_CON::Thread-10::put: 39ms
CHM_CON::Thread-9::get: 630ms
CHM_CON::Thread-10::get: 629ms
CHM_CON::Thread-7::get: 635ms
CHM_CON::Thread-11::get: 630ms
CHM_CON::Thread-8::get: 631ms
CHM_CON::Thread-6::get: 635ms
CHM_CON::Thread-10::remove: 11ms
CHM_CON::Thread-6::remove: 9ms
CHM_CON::Thread-7::remove: 11ms
CHM_CON::Thread-8::remove: 10ms
CHM_CON::Thread-11::remove: 11ms
CHM_CON::Thread-9::remove: 12ms
CHM_CON: 680 ms
HM_CON::Thread-16::put: 643ms
HM_CON::Thread-15::put: 664ms
HM_CON::Thread-14::put: 670ms
HM_CON::Thread-13::put: 674ms
HM_CON::Thread-17::put: 687ms
HM_CON::Thread-12::put: 715ms
HM_CON::Thread-16::get: 61876ms
HM_CON::Thread-15::get: 62214ms
HM_CON::Thread-13::get: 62312ms
HM_CON::Thread-17::get: 62309ms
HM_CON::Thread-14::get: 62450ms
HM_CON::Thread-16::remove: 638ms
HM_CON::Thread-15::remove: 555ms
HM_CON::Thread-13::remove: 521ms
HM_CON::Thread-17::remove: 522ms
HM_CON::Thread-14::remove: 411ms
HM_CON::Thread-12::get: 62816ms
HM_CON::Thread-12::remove: 8ms
HM_CON: 63538 ms

lhotari · 2023-06-26T11:39:57Z

@codelipenghui

https://github.com/apache/pulsar/actions/runs/5373765736/jobs/9748615177?pr=20647 Please help check this failed job.

@codelipenghui exception is

  Error:  Tests run: 4, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 0.665 s <<< FAILURE! - in org.apache.pulsar.broker.service.persistent.MessageDuplicationTest
  Error:  org.apache.pulsar.broker.service.persistent.MessageDuplicationTest.testInactiveProducerRemove  Time elapsed: 0.056 s  <<< FAILURE!
  java.lang.ClassCastException: class org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap cannot be cast to class java.util.Map (org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap is in unnamed module of loader 'app'; java.util.Map is in module java.base of loader 'bootstrap')
  	at org.apache.pulsar.broker.service.persistent.MessageDuplicationTest.testInactiveProducerRemove(MessageDuplicationTest.java:177)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:833)

mattisonchao · 2023-06-26T11:45:46Z

After doing a benchmark test
We should use ConcurrentHashMap here, it performed super better than ConcurrentOpenHashMap and synchronized HashMap for put/remove operations.

@codelipenghui
You have to avoid using remove in for-each If you chose ConcurrentHashMap . because there is a .ConcurrentModificationException

codelipenghui · 2023-06-27T02:03:39Z

You have to avoid using remove in for-each If you chose ConcurrentHashMap . because there is a .ConcurrentModificationException

@mattisonchao I need to use iterator if the ConcurrentHashMap is the final decision.

eolivelli

+1

hangc0276

Nice catch!

…ssageDeduplication

codecov-commenter · 2023-06-29T02:37:36Z

Codecov Report

Merging #20647 (b0cebfd) into master (2b01f83) will increase coverage by 39.54%.
The diff coverage is 100.00%.

@@              Coverage Diff              @@
##             master   #20647       +/-   ##
=============================================
+ Coverage     33.58%   73.12%   +39.54%     
- Complexity    12127    32016    +19889     
=============================================
  Files          1613     1867      +254     
  Lines        126241   138683    +12442     
  Branches      13770    15240     +1470     
=============================================
+ Hits          42396   101410    +59014     
+ Misses        78331    29249    -49082     
- Partials       5514     8024     +2510

Flag	Coverage Δ
inttests	`24.18% <50.00%> (-0.01%)`	⬇️
systests	`25.02% <50.00%> (?)`
unittests	`72.36% <100.00%> (+40.32%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...roker/service/persistent/MessageDeduplication.java	`82.53% <100.00%> (+42.35%)`	⬆️

... and 1520 files with indirect coverage changes

…ssageDeduplication (#20647)

…ssageDeduplication (#20647) (cherry picked from commit 31c5b4d)

…ssageDeduplication (apache#20647) (cherry picked from commit fa68bf3)

codelipenghui self-assigned this Jun 26, 2023

codelipenghui added this to the 3.1.0 milestone Jun 26, 2023

codelipenghui added release/2.10.5 release/2.11.2 release/3.0.2 labels Jun 26, 2023

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 26, 2023

codelipenghui added area/broker ready-to-test and removed doc-not-needed Your PR changes do not impact docs labels Jun 26, 2023

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 26, 2023

codelipenghui requested review from merlimat, Technoboy-, shibd, mattisonchao, hangc0276 and gaoran10 June 26, 2023 04:22

lhotari approved these changes Jun 26, 2023

View reviewed changes

mattisonchao approved these changes Jun 26, 2023

View reviewed changes

eolivelli requested changes Jun 26, 2023

View reviewed changes

eolivelli approved these changes Jun 28, 2023

View reviewed changes

shibd approved these changes Jun 28, 2023

View reviewed changes

coderzc approved these changes Jun 28, 2023

View reviewed changes

hangc0276 approved these changes Jun 28, 2023

View reviewed changes

[fix][broker] Fix the publish latency spike from the contention of Me…

4c6ecaa

…ssageDeduplication

codelipenghui added 5 commits June 29, 2023 08:55

Fix tests

b6f2132

Fix tests

ff4d55b

Use ConcurrentHashMap

0334af4

Use ConcurrentHashMap

8d98f76

Use ConcurrentHashMap

b0cebfd

codelipenghui force-pushed the penghui/contention-dedup branch from 618e055 to b0cebfd Compare June 29, 2023 00:56

Technoboy- merged commit 31c5b4d into apache:master Jun 29, 2023

Technoboy- pushed a commit that referenced this pull request Jun 29, 2023

[fix][broker] Fix the publish latency spike from the contention of Me…

fa68bf3

…ssageDeduplication (#20647)

Technoboy- added the cherry-picked/branch-2.10 label Jun 29, 2023

Technoboy- pushed a commit that referenced this pull request Jun 29, 2023

[fix][broker] Fix the publish latency spike from the contention of Me…

90d5820

…ssageDeduplication (#20647)

Technoboy- added the cherry-picked/branch-2.11 label Jun 29, 2023

codelipenghui deleted the penghui/contention-dedup branch June 30, 2023 10:34

codelipenghui added a commit that referenced this pull request Jun 30, 2023

[fix][broker] Fix the publish latency spike from the contention of Me…

1280225

…ssageDeduplication (#20647) (cherry picked from commit 31c5b4d)

codelipenghui added cherry-picked/branch-3.0 release/3.0.1 and removed release/3.0.2 labels Jun 30, 2023

nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 3, 2023

[fix][broker] Fix the publish latency spike from the contention of Me…

34401aa

…ssageDeduplication (apache#20647) (cherry picked from commit fa68bf3)

codelipenghui mentioned this pull request Sep 5, 2023

[improve] [doc] Add doc for customized util class #21110

Merged

15 tasks

BewareMyPower mentioned this pull request Aug 23, 2024

Replace the ConcurrentOpenHashMap as much as possible #23215

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647

[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647

codelipenghui commented Jun 26, 2023 •

edited

Loading

codelipenghui commented Jun 26, 2023

lhotari left a comment

mattisonchao commented Jun 26, 2023

eolivelli Jun 26, 2023

codelipenghui Jun 26, 2023

codelipenghui commented Jun 26, 2023

lhotari commented Jun 26, 2023

mattisonchao commented Jun 26, 2023 •

edited

Loading

codelipenghui commented Jun 27, 2023

eolivelli left a comment

hangc0276 left a comment

codecov-commenter commented Jun 29, 2023

		@@ -455,23 +455,17 @@ public synchronized void producerRemoved(String producerName) {
		public synchronized void purgeInactiveProducers() {

[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647

[fix][broker] Fix the publish latency spike from the contention of MessageDeduplication #20647

Conversation

codelipenghui commented Jun 26, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

codelipenghui commented Jun 26, 2023

lhotari left a comment

Choose a reason for hiding this comment

mattisonchao commented Jun 26, 2023

eolivelli Jun 26, 2023

Choose a reason for hiding this comment

codelipenghui Jun 26, 2023

Choose a reason for hiding this comment

codelipenghui commented Jun 26, 2023

lhotari commented Jun 26, 2023

mattisonchao commented Jun 26, 2023 • edited Loading

codelipenghui commented Jun 27, 2023

eolivelli left a comment

Choose a reason for hiding this comment

hangc0276 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 29, 2023

Codecov Report

codelipenghui commented Jun 26, 2023 •

edited

Loading

mattisonchao commented Jun 26, 2023 •

edited

Loading