Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Questions about pulsar broker direct OOM #12169

Open
wenbingshen opened this issue Sep 24, 2021 · 6 comments
Open

[BUG] Questions about pulsar broker direct OOM #12169

wenbingshen opened this issue Sep 24, 2021 · 6 comments
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@wenbingshen
Copy link
Member

wenbingshen commented Sep 24, 2021

Describe the bug
A clear and concise description of what the bug is.

Pulsar and bookkeeper version:
pulsar-2.8.0 and pulsar-2.8.0 built-in bookkeeper
cluster with 5 brokers and 5 bookies

In order to figure out the reason for the OOM of the pulsar broker's direct memory, I tested different scenarios and got some different results.

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

As shown in the figure below, I suspect that a large number of entry requests written to bookie have not received all the WQ responses, which makes PendingAddOp unable to be recycled or destroyed.

image

Therefore, I use maxMessagePublishBufferSizeInMB to limit the traffic handled by the broker according to #7406 and #6178.

But next is my test results:

  1. The broker is configured with maxMessagePublishBufferSizeInMB=512, EW A=3:3:2, OOM still occurs after the pressure test
  2. The broker configures maxMessagePublishBufferSizeInMB=512, and tests EW A=3:3:3, 3:2:2, and 2:2:2 respectively. After the pressure test, the direct memory is normal
  3. The broker configures maxMessagePublishBufferSizeInMB=2048, test EW A=3:3:3 and 3:2:2, after the pressure test, the direct memory is normal
  4. The broker configuration keeps maxMessagePublishBufferSizeInMB as the default value, the default is 1/2 of the maximum allocated off-heap memory (8/2=4GB in the test), test EW A=3:3:3 and 3:2:2, pressure test The off-heap memory is normal
  5. The broker configures maxMessagePublishBufferSizeInMB=-1, closes current limiting measures, tests EW A=3:3:3 and 3:2:2, the memory is normal after the pressure test
  6. The broker configures maxMessagePublishBufferSizeInMB=-1, closes current limiting measures, tests EW A=3:3:2, OOM occurs after the pressure test

The next questions also are related to #9562

My question is, whether maxMessagePublishBufferSizeInMB is configured or not,
as long as AQ=WQ, direct memory is normal,
as long as AQ<WQ, direct memory will appear OOM,
this may be related to bookie’s processing logic, but how does maxMessagePublishBufferSizeInMB work?

Except that the EWA configuration ratio is different, all tests use the same configuration and only include writing, no consumption

workloads yaml

topics: 1
partitionsPerTopic: 2
messageSize: 1024
payloadFile: "payload/payload-1Kb.data"
subscriptionsPerTopic: 0
consumerPerSubscription: 0
producersPerTopic: 2
producerRate: 880000000
consumerBacklogSizeGB: 0
testDurationMinutes: 60

@wenbingshen
Copy link
Member Author

ping @merlimat @codelipenghui @lhotari PTAL

@lhotari
Copy link
Member

lhotari commented Sep 24, 2021

@wenbingshen Do you have a chance to test with 2.8.1 ? That contains quite a few fixes, just to see if there's a difference.

producerRate: 880000000

I assume you are intentionally testing an overload situation?

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

That is probably expected if there's such a high load on the system.

One possibility to protect from overload is to configure rate limiters on the system.
However, it would be good if the Pulsar system would have backpressure (even without rate limiting configured) to prevent the system getting into a state where it breaks because of OOM. One such improvement suggestion is documented in #10439 .

@wenbingshen
Copy link
Member Author

wenbingshen commented Sep 24, 2021

@wenbingshen Do you have a chance to test with 2.8.1 ? That contains quite a few fixes, just to see if there's a difference.

producerRate: 880000000

I assume you are intentionally testing an overload situation?

After analyzing the pulsar broker heap dump, a large number of PendingAddOp instances have not been recycled or destroyed.

That is probably expected if there's such a high load on the system.

One possibility to protect from overload is to configure rate limiters on the system.
However, it would be good if the Pulsar system would have backpressure (even without rate limiting configured) to prevent the system getting into a state where it breaks because of OOM. One such improvement suggestion is documented in #10439 .

@lhotari Thank you very much for your reply. I don’t know much about bookkeeper's back pressure mechanism and related parameters. I will learn this content later. In fact, here, the question I want to understand is:
Keep the producerRate at 880000000, and perform the following four tests for the same traffic:

  1. Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2
  2. Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2
  3. Configure maxMessagePublishBufferSizeInMB=-1, which is related to turning off the current limit, and OOM occurs when EWA=3:3:2
  4. Configure maxMessagePublishBufferSizeInMB=-1, which is related to turning off the current limit. OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2
    Why compare 1 and 3, 2 and 4, the maxMessagePublishBufferSizeInMB parameter turns on or turns off the current limit, there is no effect, the test results of the two are the same, then what is the working meaning of maxMessagePublishBufferSizeInMB?

@github-actions
Copy link

github-actions bot commented Mar 1, 2022

The issue had no activity for 30 days, mark with Stale label.

@lhotari
Copy link
Member

lhotari commented Sep 27, 2022

  • Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2
  • Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2

@wenbingshen this seems to match the problem description of #14861

@wenbingshen
Copy link
Member Author

  • Configure maxMessagePublishBufferSizeInMB>0, OOM occurs when EWA=3:3:2
  • Configure maxMessagePublishBufferSizeInMB>0, OOM will not occur when EWA=3:3:3, 3:2:2, and 2:2:2

@wenbingshen this seems to match the problem description of #14861

@lhotari You are right, same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants