[BUG] Questions about pulsar broker direct OOM #12169

wenbingshen · 2021-09-24T07:02:04Z

Describe the bug
A clear and concise description of what the bug is.

Pulsar and bookkeeper version:
pulsar-2.8.0 and pulsar-2.8.0 built-in bookkeeper
cluster with 5 brokers and 5 bookies

In order to figure out the reason for the OOM of the pulsar broker's direct memory, I tested different scenarios and got some different results.

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

As shown in the figure below, I suspect that a large number of entry requests written to bookie have not received all the WQ responses, which makes PendingAddOp unable to be recycled or destroyed.

Therefore, I use maxMessagePublishBufferSizeInMB to limit the traffic handled by the broker according to #7406 and #6178.

But next is my test results:

The broker is configured with maxMessagePublishBufferSizeInMB=512, EW A=3:3:2, OOM still occurs after the pressure test
The broker configures maxMessagePublishBufferSizeInMB=512, and tests EW A=3:3:3, 3:2:2, and 2:2:2 respectively. After the pressure test, the direct memory is normal
The broker configures maxMessagePublishBufferSizeInMB=2048, test EW A=3:3:3 and 3:2:2, after the pressure test, the direct memory is normal
The broker configuration keeps maxMessagePublishBufferSizeInMB as the default value, the default is 1/2 of the maximum allocated off-heap memory (8/2=4GB in the test), test EW A=3:3:3 and 3:2:2, pressure test The off-heap memory is normal
The broker configures maxMessagePublishBufferSizeInMB=-1, closes current limiting measures, tests EW A=3:3:3 and 3:2:2, the memory is normal after the pressure test
The broker configures maxMessagePublishBufferSizeInMB=-1, closes current limiting measures, tests EW A=3:3:2, OOM occurs after the pressure test

The next questions also are related to #9562

My question is, whether maxMessagePublishBufferSizeInMB is configured or not,
as long as AQ=WQ, direct memory is normal,
as long as AQ<WQ, direct memory will appear OOM,
this may be related to bookie’s processing logic, but how does maxMessagePublishBufferSizeInMB work?

Except that the EWA configuration ratio is different, all tests use the same configuration and only include writing, no consumption

workloads yaml

topics: 1
partitionsPerTopic: 2
messageSize: 1024
payloadFile: "payload/payload-1Kb.data"
subscriptionsPerTopic: 0
consumerPerSubscription: 0
producersPerTopic: 2
producerRate: 880000000
consumerBacklogSizeGB: 0
testDurationMinutes: 60

wenbingshen · 2021-09-24T12:15:08Z

ping @merlimat @codelipenghui @lhotari PTAL

lhotari · 2021-09-24T12:38:35Z

@wenbingshen Do you have a chance to test with 2.8.1 ? That contains quite a few fixes, just to see if there's a difference.

producerRate: 880000000

I assume you are intentionally testing an overload situation?

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

That is probably expected if there's such a high load on the system.

One possibility to protect from overload is to configure rate limiters on the system.
However, it would be good if the Pulsar system would have backpressure (even without rate limiting configured) to prevent the system getting into a state where it breaks because of OOM. One such improvement suggestion is documented in #10439 .

wenbingshen · 2021-09-24T12:54:54Z

@wenbingshen Do you have a chance to test with 2.8.1 ? That contains quite a few fixes, just to see if there's a difference.

producerRate: 880000000

I assume you are intentionally testing an overload situation?

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

That is probably expected if there's such a high load on the system.

One possibility to protect from overload is to configure rate limiters on the system.
However, it would be good if the Pulsar system would have backpressure (even without rate limiting configured) to prevent the system getting into a state where it breaks because of OOM. One such improvement suggestion is documented in #10439 .

@lhotari Thank you very much for your reply. I don’t know much about bookkeeper's back pressure mechanism and related parameters. I will learn this content later. In fact, here, the question I want to understand is:
Keep the producerRate at 880000000, and perform the following four tests for the same traffic:

Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2
Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2
Configure maxMessagePublishBufferSizeInMB=-1, which is related to turning off the current limit, and OOM occurs when EWA=3:3:2
Configure maxMessagePublishBufferSizeInMB=-1, which is related to turning off the current limit. OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2
Why compare 1 and 3, 2 and 4, the maxMessagePublishBufferSizeInMB parameter turns on or turns off the current limit, there is no effect, the test results of the two are the same, then what is the working meaning of maxMessagePublishBufferSizeInMB?

github-actions · 2022-03-01T02:02:00Z

The issue had no activity for 30 days, mark with Stale label.

lhotari · 2022-09-27T05:35:59Z

Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2

Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2

@wenbingshen this seems to match the problem description of #14861

wenbingshen · 2022-09-27T12:20:50Z

Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2

Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2

@wenbingshen this seems to match the problem description of #14861

@lhotari You are right, same problem.

wenbingshen added the type/bug The PR fixed a bug or issue reported a bug label Sep 24, 2021

sijie mentioned this issue Sep 24, 2021

ISSUE-12169: [BUG] Questions about pulsar broker direct OOM streamnative/pulsar-archived#3090

Open

github-actions bot added the lifecycle/stale label Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Questions about pulsar broker direct OOM #12169

[BUG] Questions about pulsar broker direct OOM #12169

wenbingshen commented Sep 24, 2021 •

edited

Loading

wenbingshen commented Sep 24, 2021

lhotari commented Sep 24, 2021

wenbingshen commented Sep 24, 2021 •

edited

Loading

github-actions bot commented Mar 1, 2022

lhotari commented Sep 27, 2022

wenbingshen commented Sep 27, 2022

[BUG] Questions about pulsar broker direct OOM #12169

[BUG] Questions about pulsar broker direct OOM #12169

Comments

wenbingshen commented Sep 24, 2021 • edited Loading

wenbingshen commented Sep 24, 2021

lhotari commented Sep 24, 2021

wenbingshen commented Sep 24, 2021 • edited Loading

github-actions bot commented Mar 1, 2022

lhotari commented Sep 27, 2022

wenbingshen commented Sep 27, 2022

wenbingshen commented Sep 24, 2021 •

edited

Loading

wenbingshen commented Sep 24, 2021 •

edited

Loading