[fix][broker] Remove timestamp from Prometheus metrics #17419

michaeljmarshall · 2022-09-02T03:58:41Z

Motivation

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior.

In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker.

Analysis

Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined here. However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time.

This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/.

Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:

Note, however, that adding timestamps is an extremely niche use
case. Most of the users who think the need it should actually not do
it.

The main usecases within that tiny niche are federation and mirroring
the data from another monitoring system.

The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp.

The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps.

As such, I think we are not a niche use case, and we should not add timestamps to our metrics.

Reproducing the problem

Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config:

  - job_name: broker
    honor_timestamps: true
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    follow_redirects: true
    static_configs:
      - targets: ["localhost:8080"]

Start pulsar standalone on the same machine. I used a recently compiled version of master.
Publish messages to a topic.
Observe pulsar_in_messages_total metric for the topic in the prometheus UI (localhost:9090)
Stop the producer.
Unload the topic from the broker.
Optionally, curl the metrics endpoint to verify that the topic’s pulsar_in_messages_total metric is no longer reported.
Watch the metrics get reported in prometheus for 5 additional minutes.

When you set honor_timestamps: false, the metric stops getting reported right after the topic is unloaded, which is the desired behavior.

Modifications

Remove all timestamps from metrics
Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules)

Verifying this change

This change is accompanied by updated tests.

Does this pull request potentially affect one of the following parts:

This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility.

Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set honor_timestamps: false. However, I think it is better to just remove them.

Documentation

doc-not-needed

eolivelli

Great catch !
I thought that there was a problem in topic unloading but actually this is a great explanation.

Eager to see this released

lhotari

Great work @michaeljmarshall

lhotari · 2022-09-02T06:26:08Z

This change is also consistent with ZK and BK since the Prometheus metrics for ZK or BK don't have the timestamps.

I verified this on a local microk8s cluster by opening a shell to a ZK and a BK pod:

no timestamps in ZK metrics

I have no name!@pulsar-testenv-pulsar-zookeeper-0:/pulsar$ curl -s http://localhost:8000/metrics|tail -n 10
process_open_fds 314.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65536.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 5.394948096E9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.96804608E8

no timestamps in BK metrics

I have no name!@pulsar-testenv-pulsar-bookkeeper-0:/pulsar$ curl -s localhost:8000/metrics | tail -n 10
bookie_bookie_zk_create_sum{success="false"} 0.0
bookie_bookie_zk_create{success="true",quantile="0.5"} NaN
bookie_bookie_zk_create{success="true",quantile="0.75"} NaN
bookie_bookie_zk_create{success="true",quantile="0.95"} NaN
bookie_bookie_zk_create{success="true",quantile="0.99"} NaN
bookie_bookie_zk_create{success="true",quantile="0.999"} NaN
bookie_bookie_zk_create{success="true",quantile="0.9999"} NaN
bookie_bookie_zk_create{success="true",quantile="1.0"} NaN
bookie_bookie_zk_create_count{success="true"} 3
bookie_bookie_zk_create_sum{success="true"} 12.0

nicoloboschi

Great work @michaeljmarshall

eolivelli · 2022-09-02T09:30:19Z

@michaeljmarshall can you please resolve the conflicts ?

michaeljmarshall · 2022-09-02T16:52:21Z

Done. #15558 reduced the number of places we add the timestamp, so the diff is slightly smaller now.

mattisonchao · 2022-09-13T01:50:50Z

Hello @michaeljmarshall
It looks like we got many conflicts when cherry-picking it to branch-2.9.
Would you mind helping cherry-pick it(To avloid involving bugs)?

michaeljmarshall · 2022-09-13T01:54:42Z

Hi @mattisonchao, it's because this PR relies on #15558. I have been trying to figure out if we can/should cherry pick that PR. If we do not, we should cherry pick this commit b5cb02d instead, which was my original work and should have fewer conflicts. Do you have an opinion on #15558? (I am happy to help cherry picking the commit, I just need to figure out what to cherry pick first.)

mattisonchao · 2022-09-13T02:27:45Z

@michaeljmarshall
I left a comment at #15558, when it got cherry-picked, we can do the next step.
Very much thanks for your help.

congbobo184 · 2022-11-15T02:08:02Z

@michaeljmarshall hi, could you please cherry-pick this PR to branch-2.9? thanks.

congbobo184 · 2022-11-17T11:56:43Z

@michaeljmarshall hi, I move this PR to release/2.9.5, if you have any questions, please ping me. thanks.

…- reapply

### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed`

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) This change is accompanied by updated tests. This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. - [x] `doc-not-needed` (cherry picked from commit 0bbc4e1)

### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed` (cherry picked from commit 0bbc4e1)

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) This change is accompanied by updated tests. This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. - [x] `doc-not-needed` (cherry picked from commit 0bbc4e1) (cherry picked from commit e59aac7)

### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed` (cherry picked from commit 0bbc4e1)

michaeljmarshall added area/metrics release/note-required doc-not-needed Your PR changes do not impact docs release/2.9.4 release/2.8.5 release/2.11.1 release/2.10.3 labels Sep 2, 2022

michaeljmarshall added this to the 2.12.0 milestone Sep 2, 2022

michaeljmarshall requested review from merlimat, lhotari, eolivelli, codelipenghui and congbobo184 September 2, 2022 03:58

michaeljmarshall self-assigned this Sep 2, 2022

eolivelli approved these changes Sep 2, 2022

View reviewed changes

eolivelli mentioned this pull request Sep 2, 2022

[fix][broker] Remove timestamp from broker metrics datastax/pulsar#129

Merged

lhotari approved these changes Sep 2, 2022

View reviewed changes

nicoloboschi approved these changes Sep 2, 2022

View reviewed changes

[fix][broker] Remove timestamp from broker metrics

269d398

michaeljmarshall force-pushed the remove-timestamp-from-metrics branch from b5cb02d to 269d398 Compare September 2, 2022 16:50

michaeljmarshall mentioned this pull request Sep 2, 2022

[fix][broker][functions-worker] Ensure prometheus metrics are grouped by type (#8407, #13865) #15558

Merged

2 tasks

hezhangjian approved these changes Sep 3, 2022

View reviewed changes

HQebupt approved these changes Sep 4, 2022

View reviewed changes

michaeljmarshall merged commit 0bbc4e1 into apache:master Sep 7, 2022

michaeljmarshall deleted the remove-timestamp-from-metrics branch September 7, 2022 03:02

congbobo184 added release/2.9.5 and removed release/2.9.4 labels Nov 17, 2022

liangyepianzhou added release/2.10.4 and removed release/2.10.3 labels Dec 13, 2022

nicoloboschi added a commit to datastax/pulsar that referenced this pull request Jan 13, 2023

[fix][broker] Remove timestamp from Promtheus metrics (apache#17419) …

f6859e6

…- reapply

Technoboy- added the cherry-picked/branch-2.11 label Feb 8, 2023

liangyepianzhou added the cherry-picked/branch-2.10 label Feb 10, 2023

coderzc added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Feb 28, 2023

momo-jun changed the title ~~[fix][broker] Remove timestamp from Promtheus metrics~~ [fix][broker] Remove timestamp from Prometheus metrics Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Remove timestamp from Prometheus metrics #17419

[fix][broker] Remove timestamp from Prometheus metrics #17419

michaeljmarshall commented Sep 2, 2022 •

edited

Loading

eolivelli left a comment

lhotari left a comment

lhotari commented Sep 2, 2022

nicoloboschi left a comment

eolivelli commented Sep 2, 2022

michaeljmarshall commented Sep 2, 2022

mattisonchao commented Sep 13, 2022

michaeljmarshall commented Sep 13, 2022

mattisonchao commented Sep 13, 2022 •

edited

Loading

congbobo184 commented Nov 15, 2022 •

edited

Loading

congbobo184 commented Nov 17, 2022

[fix][broker] Remove timestamp from Prometheus metrics #17419

[fix][broker] Remove timestamp from Prometheus metrics #17419

Conversation

michaeljmarshall commented Sep 2, 2022 • edited Loading

Motivation

Analysis

Reproducing the problem

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

eolivelli left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

lhotari commented Sep 2, 2022

nicoloboschi left a comment

Choose a reason for hiding this comment

eolivelli commented Sep 2, 2022

michaeljmarshall commented Sep 2, 2022

mattisonchao commented Sep 13, 2022

michaeljmarshall commented Sep 13, 2022

mattisonchao commented Sep 13, 2022 • edited Loading

congbobo184 commented Nov 15, 2022 • edited Loading

congbobo184 commented Nov 17, 2022

michaeljmarshall commented Sep 2, 2022 •

edited

Loading

mattisonchao commented Sep 13, 2022 •

edited

Loading

congbobo184 commented Nov 15, 2022 •

edited

Loading