-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Optimize /metrics, fix unbounded request queue issue and fix race conditions in metricsBufferResponse mode #22494
[fix][broker] Optimize /metrics, fix unbounded request queue issue and fix race conditions in metricsBufferResponse mode #22494
Conversation
Performance test results with the changes. Reproducing the results: gh pr checkout 22494
mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true
rm -rf data
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log git clone https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# create the topics
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
|
One interesting detail is that in the load test, the system couldn't keep up with the load without making the optimization to direct buffer allocation in commit eb53342. |
pulsar-common/src/main/java/org/apache/pulsar/common/util/SimpleTextOutputStream.java
Show resolved
Hide resolved
Most recent test run without turning on the caching:
|
I did allocation profiling with Async Profiler. Used similar commands as in #22494 (comment) , but with async profiler activated with OPTS="-agentpath:$HOME/tools/async-profiler/lib/libasyncProfiler.so=start,event=cpu,alloc=2m,lock=10ms,file=$PWD/profile.jfr" PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log With After profiling, I then use this shell script function to generate multiple flamegraph htmls out of the JFR file: |
I guess it is related to CompositeByteBuf#consolidateIfNeeded, once the number of components exceeds MAX_COMPONENT, it will combine all the components to one component, memory copy happens. |
Memory copying isn't the biggest problem. A bigger problem is direct memory OOM that seems to happen when the memory space is so fragmented that there isn't free space for the allocation requests. The Netty pool has a maximum chunk size (8MB default) and all larger allocations are huge allocations that aren't pooled. That's why I think the added logic helps since it pre-allocates up to 8MB chunks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071)
…d fix race conditions in metricsBufferResponse mode (#22494) (cherry picked from commit 7009071) # Conflicts: # pulsar-broker/src/test/java/org/apache/pulsar/broker/service/persistent/BucketDelayedDeliveryTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/service/persistent/PersistentTopicTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/service/schema/SchemaServiceTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/stats/PrometheusMetricsTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/transaction/buffer/TransactionBufferClientTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/transaction/pendingack/PendingAckPersistentTest.java # pulsar-broker/src/test/java/org/apache/pulsar/broker/web/WebServiceTest.java
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071)
With |
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
+ "the connection due to a timeout ({} ms elapsed): {}", time, e + ""); | ||
} else { | ||
log.error("Failed to generate prometheus stats, {} ms elapsed", time, e); | ||
// set hard timeout to 2 * timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this makes me happy we're moving to a consolidated maintained exporters in OTel :)
…d fix race conditions in metricsBufferResponse mode (apache#22494) (cherry picked from commit 7009071) (cherry picked from commit 5f9d7c5)
Fixes #22477
Motivation
There are multiple problems in the /metrics endpoint:
Modifications
metricsBufferResponse
mode isn't enabledmetricsBufferResponse
solution since the TimeWindow and WindowWrap classes arean't needed at all. The concurrent request combining solution can also be used formetricsBufferResponse
solution.Documentation
doc
doc-required
doc-not-needed
doc-complete