Pulsar upgrade to 3.0.5 causes prometheus metrics timeouts on brokers #22897

justin-lathrop · 2024-06-12T14:49:41Z

justin-lathrop
Jun 12, 2024

After performing an upgrade from pulsar 2.9.4 to pulsar 3.0.5 within a Kubernetes cluster using pulsar helm 3.0 chart the Prometheus Metrics stopped working via the pulsar-brokers.

The pulsar-broker logs show a constant stream of 500 responses with timeouts, and 302 redirects.

... INFO org.eclipse.jetty.server.RequestLog - ... "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.42.0" 0
... WARN org.apache.pulsar.broker.stats.prometheus.PulsarPrometheusMetricsServlet - Prometheus metrics request timed out
... INFO org.eclipse.jetty.server.RequestLog - ... "GET /metrics/ HTTP/1.1" 500 - "http:/1.2.3.4:8080/metrics" "Prometheus/2.42.0" 60001

Running pulsar-admin broker-stats monitoring-metrics does return metrics with values of data moving through. But exec'd into the pulsar-broker-0 pod and `curl http://localhost:8080/metrics/" only times out.

The pulsar-proxy instance was also upgraded as part of this and it is reporting metrics as expected with no timeouts by curling the pulsar-proxy-0 metrics endpoint like so. curl http://localhost:8080/metrics/ from within the pod.

The error reported seems to be at this line in the code, but at present I do not see any of the other possible logs, it just seems to timeout every time in the async process. https://github.com/apache/pulsar/blob/branch-3.0/pulsar-broker/src/main/java/org/apache/pulsar/broker/stats/prometheus/PulsarPrometheusMetricsServlet.java#L83

Any help would be greatly appreciated!

justin-lathrop · 2024-06-12T20:45:58Z

justin-lathrop
Jun 12, 2024
Author

I deployed instead the 3.0.4 release and the broker metrics started working again. I wonder if its related to some of the metrics related changes in this commit? Which was put into the 3.0.5 release.

7009071

15 replies

justin-lathrop Jun 18, 2024
Author

It is ancient I agree, but its my dev cluster and hasn't been rebuilt in quite a while. This cluster is mostly unused, it has 12 total topics all of which have no data going into them.

On a different cluster where this was found the PULSAR_PREFIX_exposeTopicLevelMetricsInPrometheus: "false" config was tried prior to this initial post, and it did not fix the problem. I check on that cluster's info out of curiosity and its RedHat 8, with still cgroups v1.

justin-lathrop Jun 18, 2024
Author

I tried increasing the PULSAR_MEM to -Xmx512m -XX:MaxDirectMemorySize=512m also with PULSAR_PREFIX_exposeTopicLevelMetricsInPrometheus: "false" then rollout restart the pulsar-broker sts, the metrics still hang though.

lhotari Jun 18, 2024
Collaborator

Since it's a dev cluster, perhaps you could enable debug logging for Pulsar brokers to see if it would reveal some useful details about the problem with metrics. You can enable debug logging by setting PULSAR_LOG_LEVEL: debug in the broker's environment (in broker.configData).

lhotari Jun 18, 2024
Collaborator

I check on that cluster's info out of curiosity and its RedHat 8, with still cgroups v1.

yes, RHEL 8 defaults to cgroups v1, but it's possible to switch to cgroups v2. I don't think that there's much advantage of switching. What seems to matter more with Kubernetes and Java is having a fairly recent kernel version.
Btw. It's recommended to configured Kubernetes notes with THP setting set to madvise. This is what you get also on Cloud provider managed Kubernetes nodes. Azul has a good guide how to configure THP this for RHEL and others. It's recommended to configure THP this way also when huge pages aren't used. The default /sys/kernel/mm/transparent_hugepage/enabled setting of always is bad for running Java unless -XX:+AlwaysPreTouch and MALLOC_ARENA_MAX=2 is used.

justin-lathrop Jul 8, 2024
Author

Hey @lhotari sorry its been a bit. But I was able to circle back to this and realized I am using kafka-on-pulsar 3.0.0.4 on this install as well. I only realized when I added some logs to some of the pulsar-broker metrics classes and found the following exception pop up:

2024-07-05T16:35:12,221Z [jdk.internal.loader.ClassLoaders$AppClassLoader@5bc2b487] error Uncaught exception in thread pool-15-thread-76: 'org.apache.pulsar.common.util.SimpleTextOutputStream org.apache.pulsar.common.util.SimpleTextOutputStream.write(java.lang.String)'
java.lang.NoSuchMethodError: 'org.apache.pulsar.common.util.SimpleTextOutputStream org.apache.pulsar.common.util.SimpleTextOutputStream.write(java.lang.String)'
		at io.streamnative.pulsar.handlers.kop.stats.PrometheusTextFormatUtil.writeGauge(PrometheusTextFormatUtil.java:32)
		at io.streamnative.pulsar.handlers.kop.stats.PrometheusMetricsProvider.lambda$generate$1(PrometheusMetricsProvider.java:87)
		at java.base/java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1603)
		at io.streamnative.pulsar.handlers.kop.stats.PrometheusMetricsProvider.generate(PrometheusMetricsProvider.java:87)
		at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsGenerator.generateMetrics(PrometheusMetricsGenerator.java:345)
		at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsGenerator.lambda$renderToBuffer$0(PrometheusMetricsGenerator.java:528)
		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
		at java.base/java.lang.Thread.run(Thread.java:840)

I went ahead and removed the kop .nar from the installed image and the pulsar-broker metrics began to work again. Then I pulled down the kop repo and rebuilt it against pulsar 3.0.5.5 and used that new .nar and was able to get pulsar-broker metrics still.

I know the kop repo is archived but it looks like that was my issue.

merlimat · 2024-06-13T01:59:23Z

merlimat
Jun 13, 2024
Collaborator

@lhotari PTAL when you have the chance

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulsar upgrade to 3.0.5 causes prometheus metrics timeouts on brokers #22897

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pulsar upgrade to 3.0.5 causes prometheus metrics timeouts on brokers #22897

justin-lathrop Jun 12, 2024

Replies: 2 comments · 15 replies

justin-lathrop Jun 12, 2024 Author

justin-lathrop Jun 18, 2024 Author

justin-lathrop Jun 18, 2024 Author

lhotari Jun 18, 2024 Collaborator

lhotari Jun 18, 2024 Collaborator

justin-lathrop Jul 8, 2024 Author

merlimat Jun 13, 2024 Collaborator

justin-lathrop
Jun 12, 2024

Replies: 2 comments 15 replies

justin-lathrop
Jun 12, 2024
Author

justin-lathrop Jun 18, 2024
Author

justin-lathrop Jun 18, 2024
Author

lhotari Jun 18, 2024
Collaborator

lhotari Jun 18, 2024
Collaborator

justin-lathrop Jul 8, 2024
Author

merlimat
Jun 13, 2024
Collaborator