"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116

gillg · 2021-05-06T07:26:32Z

Description:
This duplicates the target of #2918 but in a completly different approach.
It solves #3089 at least but also other related bugs about prometheus autogenerated metrics by its scrapper : https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

Link to tracking Issue:
#3089

Resolves :
open-telemetry/wg-prometheus#8
Maybe open-telemetry/wg-prometheus#41 ?

Testing:
No new tests because we stay standard with other metrics.
I just "wake up" dormant metrics without metadata.

Documentation:
Nothing more, new metrics up and scrape_xxxx will be available internaly, and so at the exporter side.

linux-foundation-easycla · 2021-05-06T07:26:36Z

The committers are authorized under a signed CLA.

✅ Guillaume GILL (08926d3, bf36d80)

gillg · 2021-05-06T07:27:42Z

@odeke-em I was no able to contribute on your PR, and because the approach is very different I prefered create a new one to discuss with maintainers about it.

gillg · 2021-05-06T09:49:04Z

Metrics at exporter side after my second commit :

# HELP scrape_duration_seconds Duration of the scrape
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{otel_job="grafana"} 0.022050369 1620294333695
scrape_duration_seconds{otel_job="otel-collector"} 0.001391824 1620294335369
scrape_duration_seconds{otel_job="thanos-compactor"} 0.000312828 1620294327957
# HELP scrape_samples_post_metric_relabeling The number of samples remaining after metric relabeling was applied
# TYPE scrape_samples_post_metric_relabeling gauge
scrape_samples_post_metric_relabeling{otel_job="grafana"} 414 1620294333695
scrape_samples_post_metric_relabeling{otel_job="otel-collector"} 40 1620294335369
scrape_samples_post_metric_relabeling{otel_job="thanos-compactor"} 0 1620294327957
# HELP scrape_samples_scraped The number of samples the target exposed
# TYPE scrape_samples_scraped gauge
scrape_samples_scraped{otel_job="grafana"} 414 1620294333695
scrape_samples_scraped{otel_job="otel-collector"} 40 1620294335369
scrape_samples_scraped{otel_job="thanos-compactor"} 0 1620294327957
# HELP scrape_series_added The approximate number of new series in this scrape
# TYPE scrape_series_added gauge
scrape_series_added{otel_job="grafana"} 414 1620294333695
scrape_series_added{otel_job="otel-collector"} 40 1620294335369
scrape_series_added{otel_job="thanos-compactor"} 0 1620294327957
# HELP up The scraping was sucessful
# TYPE up gauge
up{otel_job="grafana"} 1 1620294333695
up{otel_job="otel-collector"} 1 1620294335369
up{otel_job="thanos-compactor"} 0 1620294327957

used scrape config :

      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 5s
          static_configs:
            - targets: ['otel-collector:8888']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: otel-collector
            target_label: otel_job
        - job_name: thanos-compactor
          static_configs:
            - targets: ['172.17.0.1:10942']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: thanos-compactor
            target_label: otel_job
        - job_name: grafana
          static_configs:
            - targets: ['172.17.0.1:3000']
          relabel_configs:
          # Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
          - action: replace
            replacement: grafana
            target_label: otel_job

dashpole

This is the approach I think we should take. Just clean up some of the extra debug code you added.

gillg · 2021-05-06T14:07:45Z

Ok, in fact there is no "useless" code. But I would uncomment my comments to enable a logger at these points.
These logs could be very useful to understand what happen. Maybe "Debug" could become "Trace".

I take any help to instanciate a logger

gillg · 2021-05-06T14:40:24Z

Fixing tests become a nightmare... I need help on receiver/prometheusreceiver/metrics_receiver_test.go ! 🆘 🙏 😅

rakyll · 2021-05-06T17:47:43Z

There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go.

gillg

There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go.

Thanks @rakyll !
It definitely works perfectly for now, I fixed a Prometheus exporter test, and some other internals, but I'm a little bit confused on the current logic tests. It's complicated because new introduced metrics should be not visible in a metrics fake documents, but they count in internal metrics. So I have the feeling that we should change a little the method doCompare but I'm not sure.

I also need a little help to instance a logger and uncomment my log lines.

receiver/prometheusreceiver/internal/metricsbuilder.go

gillg · 2021-05-06T08:36:10Z

receiver/prometheusreceiver/internal/metricsbuilder.go

@@ -133,6 +132,42 @@ func (b *metricBuilder) AddDataPoint(ls labels.Labels, t int64, v float64) error
 	return b.currentMf.Add(metricName, ls, t, v)
 }

+func (b *metricBuilder) defineInternalMetric(metricName string) {
+        metadata, ok := b.mc.Metadata(metricName)


In fact Internal metrics have empty metadata, but they are recorded correctly.
A simple solution is to provide manual metadata. Even if we uncomment return nil above, the will be dropped later because they match with "unspecified" type here : https://github.com/open-telemetry/opentelemetry-collector/blob/main/receiver/prometheusreceiver/internal/metricsbuilder.go#L244

I made this approach working perfectly when I rewrited metadata into newMetricFamiliy constructor, but I try here to fix all internal metrics before send them to metric family.
Here... Probably due to a change without reference to the metadata object, metadata is always empty when it comes to metricfamily.

gillg · 2021-05-06T09:52:55Z

receiver/prometheusreceiver/internal/metricfamily.go

+	} else if !ok && isInternalMetric(metricName) {
+               metadata = defineInternalMetric(metricName, metadata)
+        }
+        //TODO convert it to OtelMetrics ?


What we do here ?

receiver/prometheusreceiver/internal/metricfamily.go

receiver/prometheusreceiver/metrics_receiver_test.go

gillg · 2021-05-06T12:49:39Z

exporter/prometheusexporter/end_to_end_test.go

+		`test_scrape_series_added 13`,
+		`. HELP test_up The scraping was sucessful`,
+		`. TYPE test_up gauge`,
+		`test_up 1`,


Implicitely, prometheus exporter expose new metrics by default (and it makes sense, else any prometheus server on top of it is unable to know if a "sub-job" fails or not).

gillg · 2021-05-06T14:33:46Z

receiver/prometheusreceiver/internal/metricfamily.go

+
+	switch metricName {
+	case scrapeUpMetricName:
+		metadata.Unit = "bool"


Is There a convention about that ?

Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius. The "up" metric probably shouldn't have units in that case.

gillg · 2021-05-06T14:33:54Z

receiver/prometheusreceiver/internal/metricfamily.go

+		metadata.Type = textparse.MetricTypeGauge
+		metadata.Help = "The approximate number of new series in this scrape"
+	case "scrape_samples_post_metric_relabeling":
+		metadata.Unit = "count"


Is There a convention about that ?

no unit here. See the above comment.

gillg · 2021-05-06T14:34:28Z

receiver/prometheusreceiver/internal/metricfamily.go

+		metadata.Type = textparse.MetricTypeGauge
+		metadata.Help = "The scraping was sucessful"
+	case "scrape_duration_seconds":
+		metadata.Unit = "seconds"


Is There a convention about that ? I saw in test scenarios some s instead of full word seconds

See the above. We should use "seconds" here.

rakyll · 2021-05-11T22:42:05Z

cc @odeke-em

dashpole · 2021-05-19T15:54:34Z

Discussed at the prometheus wg meeting today. One request is to keep this PR narrowly tailored to the issue at hand. Can we only add the "up" metric for now, and defer on the other ones? We weren't able to reach consensus as to whether the other metrics should be added during the meeting.

Until open-telemetry/wg-prometheus#52 is resolved, we should only add the 'up' metric.

gillg · 2021-05-19T16:06:33Z

Discussed at the prometheus wg meeting today. One request is to keep this PR narrowly tailored to the issue at hand. Can we only add the "up" metric for now, and defer on the other ones? We weren't able to reach consensus as to whether the other metrics should be added during the meeting.

Until open-telemetry/wg-prometheus#52 is resolved, we should only add the 'up' metric.

Thank you for these informations ! I don't understand why not implement all native metrics but yes it's really simple to only fix "up". We just need to remove unwanted case here https://github.com/open-telemetry/opentelemetry-collector/pull/3116/files#diff-a71211e5426c3d12d9c3c0b5991e4b284568b76310438f884a37ce20655327f4R88 and it will work as previously. The metric will have no type and be rejected later in the chain.

Today the main problem is to fix test cases, and help me to instanciate a logger where it's usefull (I commited commented lines with a fake logger).
Last question, metadata.Unit = "bool" is ok as OTEL unit for "up" ?

bogdandrutu · 2021-05-19T18:50:39Z

@rakyll @Aneurysm9 @dashpole as Prometheus experts please review and help this person.

gillg · 2021-05-19T19:24:25Z

@rakyll @Aneurysm9 @dashpole as Prometheus experts please review and help this person.

Thank you @bogdandrutu (and all the team 😅 ) say me if you need access to my fork, or just guide me on comments. I can do it by myself but I don't know the good approach to inject the logger cleanly.

About the metrics, I remove other cases than "up" and I prepare a new PR in parallel for "scrape_xxx" metrics ?

Aneurysm9 · 2021-05-19T23:39:33Z

I also think this is the correct approach to take. It fully resolves the compliance test issues related to the missing up metric and appears to be a simple solution, even if the existing tests make it somewhat painful.

I managed to successfully wrangle these tests when fixing the receiver's start time adjustment logic and expect it will require similar changes (hence the current conflict with the main branch since that PR was merged). I've blocked off some time tomorrow to try to work on making them less awful.

As for whether it makes sense to only handle up in this PR, I think the work is already done to handle all of these internal metrics so we might as well carry on with that. No point adding more work down the road when we'd undoubtedly still have to do something to deal with testing those new metrics.

dashpole · 2021-05-19T23:57:05Z

As for whether it makes sense to only handle up in this PR, I think the work is already done to handle all of these internal metrics so we might as well carry on with that. No point adding more work down the road when we'd undoubtedly still have to do something to deal with testing those new metrics.

I'm also OK with that. I figured it would be easier to have to only wrangle the tests for one metric, but if it isn't much more effort, then we can do them all and just be done.

Aneurysm9 · 2021-05-26T20:12:44Z

I've hammered the receiver's e2e tests into a much more manageable shape. I was going to make a PR to the source branch of this PR, but that repo doesn't seem to be accepting PRs. For now the changes can be viewed at https://github.com/Aneurysm9/opentelemetry-collector/tree/feat/add-internal-prom-metrics. @gillg can you pull in those changes to this branch so we can get this PR unstuck?

gillg · 2021-05-27T22:11:45Z

Thank you a lot @Aneurysm9 ! Good job around the tests, I lose my hairs just reading them 😅

So last things remaining :

Solve the units question for up metric ("bool" ?) and scrape_xxx ("seconds" vs "s", and "count")
~~Add a logger usable on metricfamily to add some debugging logs for the future~~
Answer to the //TODO convert it to OtelMetrics ? line 68

After we are probably ok to merge

dashpole · 2021-06-02T16:13:26Z

receiver/prometheusreceiver/internal/metricfamily.go

+
+	switch metricName {
+	case scrapeUpMetricName:
+		metadata.Unit = "bool"


Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius. The "up" metric probably shouldn't have units in that case.

receiver/prometheusreceiver/internal/metricfamily.go

dashpole · 2021-06-02T16:14:10Z

receiver/prometheusreceiver/internal/metricfamily.go

+		metadata.Type = textparse.MetricTypeGauge
+		metadata.Help = "The scraping was sucessful"
+	case "scrape_duration_seconds":
+		metadata.Unit = "seconds"


See the above. We should use "seconds" here.

receiver/prometheusreceiver/internal/metricfamily.go

dashpole · 2021-06-02T16:14:39Z

receiver/prometheusreceiver/internal/metricfamily.go

+		metadata.Type = textparse.MetricTypeGauge
+		metadata.Help = "The approximate number of new series in this scrape"
+	case "scrape_samples_post_metric_relabeling":
+		metadata.Unit = "count"


no unit here. See the above comment.

receiver/prometheusreceiver/internal/metricfamily.go

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

dashpole · 2021-06-15T16:16:06Z

I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part...

gillg

I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part...

Good job on rebase.... ! I updated my branch. I will take a look if something else needs to be changed.

receiver/prometheusreceiver/internal/metricfamily.go

gillg · 2021-06-15T17:19:59Z

Ok, without more changes it seems work as before.
What introduce all the new things about prometheus / otel conversions ? I thought this will have impacts.

dashpole · 2021-06-15T17:30:06Z

They are introducing the change slowly over a few PRs. The ones they added aren't used yet. They may need to make some changes when they rebase on this change (assuming it merges).

Aneurysm9 · 2021-06-15T19:56:16Z

It appears that the test issues have been resolved and the current failure is a (potentially flaky) load test. Can @open-telemetry/collector-maintainers confirm this and land this PR?

gillg · 2021-06-15T20:04:40Z

Data dropped due to high memory usage is common with load tests or not ?
I experienced pretty often this kind of things with my custom OTEL build. See https://github.com/open-telemetry/opentelemetry-collector/issues/3250
I think it's not related, but just in case.

alolita · 2021-06-17T20:56:43Z

Thanks @gillg All tests are passing finally. 🎉

@bogdandrutu can you please merge.

This change updates internal code and is meant to alleviate the massive PR #5184 which is our eventual end-goal. No need for tests because the code path isn't used so we have a license to update it towards the end goal. Particularly it: * adds open-telemetry/opentelemetry-collector#3116 to the otlp version. * copies the sortPoints method over verbatim. * sets DataType and Name on pdata metrics. Updates PR #5184

gillg requested a review from a team May 6, 2021 07:26

gillg mentioned this pull request May 6, 2021

receiver/prometheus: add "up" metric for instances #2918

Closed

dashpole approved these changes May 6, 2021

View reviewed changes

gillg commented May 6, 2021

View reviewed changes

alolita added the area:prometheus label May 12, 2021

gillg changed the title ~~Try to make internal prometheus scrapper metrics working~~ "wake up" internal prometheus scrapper metrics (up / scrape_xxxx) May 17, 2021

bogdandrutu assigned rakyll, Aneurysm9 and dashpole May 19, 2021

alolita added the duplicate This issue or pull request already exists label May 19, 2021

gillg requested a review from alolita as a code owner May 27, 2021 21:58

dashpole reviewed Jun 2, 2021

View reviewed changes

receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved

dashpole reviewed Jun 2, 2021

View reviewed changes

receiver/prometheusreceiver/internal/metricfamily.go Outdated Show resolved Hide resolved

dashpole approved these changes Jun 7, 2021

View reviewed changes

Aneurysm9 and others added 12 commits June 15, 2021 09:06

receiver/prometheus: fix e2e tests

dc925ed

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

receiver/prometheus: add e2e tests for label keys and values

cec43a7

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

receiver/prometheus: remove unused test metric spec

651d07a

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>

Add useful logs

fb3faf1

Fix useless units and comments

97bf229

Fix too verbose log and bad test

408c80c

Fix prom exporter test if scrap duration is instant

701fdbb

Fix lint

1018140

Update exporter/prometheusexporter/end_to_end_test.go

789737d

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

Try to fix exporter test

1e56159

Remove unwanted file

d979b1f

fix tests after rebasing

e2c4d6d

gillg force-pushed the feat/add-internal-prom-metrics branch from 8481a50 to e2c4d6d Compare June 15, 2021 16:33

gillg commented Jun 15, 2021

View reviewed changes

receiver/prometheusreceiver/internal/metricfamily.go Show resolved Hide resolved

Fix lint

3ac8fdf

alolita added the ready-to-merge Code review completed; ready to merge by maintainers label Jun 17, 2021

alolita added release:required-for-ga Must be resolved before GA release and removed duplicate This issue or pull request already exists waiting-for-author labels Jun 17, 2021

bogdandrutu merged commit 329285d into open-telemetry:main Jun 21, 2021

odeke-em mentioned this pull request Jun 22, 2021

Isolate tests for staleness markers and up metric tests out of TestEndToEnd open-telemetry/wg-prometheus#59

Closed

gillg mentioned this pull request Jun 30, 2021

Prometheus Receiver - "up" metric not used as expected, and not recorded. #3089

Closed

odeke-em mentioned this pull request Sep 13, 2021

receiver/prometheus/internal: augment metricFamilyPdata from big PR open-telemetry/opentelemetry-collector-contrib#5185

Closed

"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116

"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116

Conversation

gillg commented May 6, 2021 • edited Loading

linux-foundation-easycla bot commented May 6, 2021 • edited Loading

gillg commented May 6, 2021

gillg commented May 6, 2021

dashpole left a comment

Choose a reason for hiding this comment

gillg commented May 6, 2021 • edited Loading

gillg commented May 6, 2021

rakyll commented May 6, 2021

gillg left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rakyll commented May 11, 2021

dashpole commented May 19, 2021

gillg commented May 19, 2021

bogdandrutu commented May 19, 2021

gillg commented May 19, 2021 • edited Loading

Aneurysm9 commented May 19, 2021

dashpole commented May 19, 2021

Aneurysm9 commented May 26, 2021

gillg commented May 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole commented Jun 15, 2021

gillg left a comment

Choose a reason for hiding this comment

gillg commented Jun 15, 2021

dashpole commented Jun 15, 2021

Aneurysm9 commented Jun 15, 2021

gillg commented Jun 15, 2021

alolita commented Jun 17, 2021

gillg commented May 6, 2021 •

edited

Loading

linux-foundation-easycla bot commented May 6, 2021 •

edited

Loading

gillg commented May 6, 2021 •

edited

Loading

gillg left a comment •

edited

Loading

gillg commented May 19, 2021 •

edited

Loading

gillg commented May 27, 2021 •

edited

Loading