[fix][broker] Remove timestamp from broker metrics (#17419)

### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed`
apache · Feb 8, 2023 · 62006ba · 62006ba
1 parent 9cab1ad
commit 62006ba
Show file tree

Hide file tree

Showing 6 changed files with 380 additions and 381 deletions.
diff --git a/...oker/src/main/java/org/apache/pulsar/broker/stats/prometheus/PrometheusMetricStreams.java b/...oker/src/main/java/org/apache/pulsar/broker/stats/prometheus/PrometheusMetricStreams.java
@@ -46,7 +46,7 @@ void writeSample(String metricName, Number value, String... labelsAndValuesArray
                 stream.write(',');
             }
         }
-        stream.write("} ").write(value).write(' ').write(System.currentTimeMillis()).write('\n');
+        stream.write("} ").write(value).write('\n');
     }
 
     /**

diff --git a/...r/src/main/java/org/apache/pulsar/broker/stats/prometheus/PrometheusMetricsGenerator.java b/...r/src/main/java/org/apache/pulsar/broker/stats/prometheus/PrometheusMetricsGenerator.java
@@ -304,8 +304,7 @@ private static void parseMetricsToPrometheusMetrics(Collection<Metrics> metrics,
                         appendedQuantile = true;
                     }
                 }
-                stream.write("} ").write(String.valueOf(entry.getValue()))
-                        .write(' ').write(System.currentTimeMillis()).write("\n");
+                stream.write("} ").write(String.valueOf(entry.getValue())).write("\n");
             }
         }
     }

diff --git a/pulsar-broker/src/test/java/org/apache/pulsar/broker/stats/PrometheusMetricsTest.java b/pulsar-broker/src/test/java/org/apache/pulsar/broker/stats/PrometheusMetricsTest.java
@@ -848,10 +848,10 @@ public void testPerConsumerStats() throws Exception {
      This can happen when including topic metrics, since the same metric is reported multiple times with different labels. For example:
 
      # TYPE pulsar_subscriptions_count gauge
-     pulsar_subscriptions_count{cluster="standalone"} 0 1556372982118
-     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata"} 1.0 1556372982118
-     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate"} 1.0 1556372982118
-     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments"} 1.0 1556372982118
+     pulsar_subscriptions_count{cluster="standalone"} 0
+     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata"} 1.0
+     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate"} 1.0
+     pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments"} 1.0
 
      **/
     // Running the test twice to make sure types are present when generated multiple times
@@ -1680,8 +1680,8 @@ public static Multimap<String, Metric> parseMetrics(String metrics) {
         // jvm_threads_current{cluster="standalone",} 203.0
         // or
         // pulsar_subscriptions_count{cluster="standalone", namespace="public/default",
-        // topic="persistent://public/default/test-2"} 0.0 1517945780897
-        Pattern pattern = Pattern.compile("^(\\w+)\\{([^\\}]+)\\}\\s([+-]?[\\d\\w\\.-]+)(\\s(\\d+))?$");
+        // topic="persistent://public/default/test-2"} 0.0
+        Pattern pattern = Pattern.compile("^(\\w+)\\{([^\\}]+)\\}\\s([+-]?[\\d\\w\\.-]+)$");
         Pattern tagsPattern = Pattern.compile("(\\w+)=\"([^\"]+)\"(,\\s?)?");
 
         Splitter.on("\n").split(metrics).forEach(line -> {

diff --git a/pulsar-broker/src/test/java/org/apache/pulsar/functions/worker/PulsarFunctionTestUtils.java b/pulsar-broker/src/test/java/org/apache/pulsar/functions/worker/PulsarFunctionTestUtils.java
@@ -74,8 +74,8 @@ public static Map<String, Metric> parseMetrics(String metrics) {
         // jvm_threads_current{cluster="standalone",} 203.0
         // or
         // pulsar_subscriptions_count{cluster="standalone", namespace="public/default",
-        // topic="persistent://public/default/test-2"} 0.0 1517945780897
-        Pattern pattern = Pattern.compile("^(\\w+)(\\{[^\\}]+\\})?\\s([+-]?[\\d\\w\\.-]+)(\\s(\\d+))?$");
+        // topic="persistent://public/default/test-2"} 0.0
+        Pattern pattern = Pattern.compile("^(\\w+)(\\{[^\\}]+\\})?\\s([+-]?[\\d\\w\\.-]+)$");
         Pattern tagsPattern = Pattern.compile("(\\w+)=\"([^\"]+)\"(,\\s?)?");
         Arrays.asList(metrics.split("\n")).forEach(line -> {
             if (line.isEmpty() || line.startsWith("#")) {
-Original file line number
+Diff line change
@@ Expand Up @@
                     stream.write(',');
                 }
             }
-            stream.write("} ").write(value).write(' ').write(System.currentTimeMillis()).write('\n');
+            stream.write("} ").write(value).write('\n');
         }
         /**
@@ Expand Down @@