-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116
"wake up" internal prometheus scrapper metrics (up / scrape_xxxx) #3116
Conversation
@odeke-em I was no able to contribute on your PR, and because the approach is very different I prefered create a new one to discuss with maintainers about it. |
Metrics at exporter side after my second commit :
used scrape config : scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['otel-collector:8888']
relabel_configs:
# Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
- action: replace
replacement: otel-collector
target_label: otel_job
- job_name: thanos-compactor
static_configs:
- targets: ['172.17.0.1:10942']
relabel_configs:
# Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
- action: replace
replacement: thanos-compactor
target_label: otel_job
- job_name: grafana
static_configs:
- targets: ['172.17.0.1:3000']
relabel_configs:
# Trick because otel collector not expose the job and to avoid "honor_labels" at prometheus side
- action: replace
replacement: grafana
target_label: otel_job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the approach I think we should take. Just clean up some of the extra debug code you added.
Ok, in fact there is no "useless" code. But I would uncomment my comments to enable a logger at these points. I take any help to instanciate a logger |
Fixing tests become a nightmare... I need help on receiver/prometheusreceiver/metrics_receiver_test.go ! 🆘 🙏 😅 |
There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many tests to fix there, @gillg. Been there, done that. Let's help you if this is the way to go.
Thanks @rakyll !
It definitely works perfectly for now, I fixed a Prometheus exporter test, and some other internals, but I'm a little bit confused on the current logic tests. It's complicated because new introduced metrics should be not visible in a metrics fake documents, but they count in internal metrics. So I have the feeling that we should change a little the method doCompare but I'm not sure.
I also need a little help to instance a logger and uncomment my log lines.
@@ -133,6 +132,42 @@ func (b *metricBuilder) AddDataPoint(ls labels.Labels, t int64, v float64) error | |||
return b.currentMf.Add(metricName, ls, t, v) | |||
} | |||
|
|||
func (b *metricBuilder) defineInternalMetric(metricName string) { | |||
metadata, ok := b.mc.Metadata(metricName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact Internal metrics have empty metadata, but they are recorded correctly.
A simple solution is to provide manual metadata. Even if we uncomment return nil above, the will be dropped later because they match with "unspecified" type here : https://github.com/open-telemetry/opentelemetry-collector/blob/main/receiver/prometheusreceiver/internal/metricsbuilder.go#L244
I made this approach working perfectly when I rewrited metadata into newMetricFamiliy constructor, but I try here to fix all internal metrics before send them to metric family.
Here... Probably due to a change without reference to the metadata object, metadata is always empty when it comes to metricfamily.
} else if !ok && isInternalMetric(metricName) { | ||
metadata = defineInternalMetric(metricName, metadata) | ||
} | ||
//TODO convert it to OtelMetrics ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we do here ?
`test_scrape_series_added 13`, | ||
`. HELP test_up The scraping was sucessful`, | ||
`. TYPE test_up gauge`, | ||
`test_up 1`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicitely, prometheus exporter expose new metrics by default (and it makes sense, else any prometheus server on top of it is unable to know if a "sub-job" fails or not).
|
||
switch metricName { | ||
case scrapeUpMetricName: | ||
metadata.Unit = "bool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is There a convention about that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius
. The "up" metric probably shouldn't have units in that case.
metadata.Type = textparse.MetricTypeGauge | ||
metadata.Help = "The approximate number of new series in this scrape" | ||
case "scrape_samples_post_metric_relabeling": | ||
metadata.Unit = "count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is There a convention about that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no unit here. See the above comment.
metadata.Type = textparse.MetricTypeGauge | ||
metadata.Help = "The scraping was sucessful" | ||
case "scrape_duration_seconds": | ||
metadata.Unit = "seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is There a convention about that ? I saw in test scenarios some s
instead of full word seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above. We should use "seconds" here.
cc @odeke-em |
Discussed at the prometheus wg meeting today. One request is to keep this PR narrowly tailored to the issue at hand. Can we only add the "up" metric for now, and defer on the other ones? We weren't able to reach consensus as to whether the other metrics should be added during the meeting. Until open-telemetry/wg-prometheus#52 is resolved, we should only add the 'up' metric. |
Thank you for these informations ! I don't understand why not implement all native metrics but yes it's really simple to only fix "up". We just need to remove unwanted Today the main problem is to fix test cases, and help me to instanciate a logger where it's usefull (I commited commented lines with a fake logger). |
@rakyll @Aneurysm9 @dashpole as Prometheus experts please review and help this person. |
Thank you @bogdandrutu (and all the team 😅 ) say me if you need access to my fork, or just guide me on comments. I can do it by myself but I don't know the good approach to inject the logger cleanly. About the metrics, I remove other cases than "up" and I prepare a new PR in parallel for "scrape_xxx" metrics ? |
I also think this is the correct approach to take. It fully resolves the compliance test issues related to the missing I managed to successfully wrangle these tests when fixing the receiver's start time adjustment logic and expect it will require similar changes (hence the current conflict with the main branch since that PR was merged). I've blocked off some time tomorrow to try to work on making them less awful. As for whether it makes sense to only handle |
I'm also OK with that. I figured it would be easier to have to only wrangle the tests for one metric, but if it isn't much more effort, then we can do them all and just be done. |
I've hammered the receiver's e2e tests into a much more manageable shape. I was going to make a PR to the source branch of this PR, but that repo doesn't seem to be accepting PRs. For now the changes can be viewed at https://github.com/Aneurysm9/opentelemetry-collector/tree/feat/add-internal-prom-metrics. @gillg can you pull in those changes to this branch so we can get this PR unstuck? |
Thank you a lot @Aneurysm9 ! Good job around the tests, I lose my hairs just reading them 😅 So last things remaining :
After we are probably ok to merge |
|
||
switch metricName { | ||
case scrapeUpMetricName: | ||
metadata.Unit = "bool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Units appear to be specific to OpenMetrics, rather than prometheus text format. See prometheus/prometheus/pkg/textparse/promparse.go#L195. For OpenMetrics, here is the documentation for units: OpenObservability/OpenMetrics/specification/OpenMetrics.md#units-and-base-units. The units we should stick to include (from the OpenMetrics link): seconds, bytes, joules, grams, meters, ratios, volts, amperes, and celsius
. The "up" metric probably shouldn't have units in that case.
metadata.Type = textparse.MetricTypeGauge | ||
metadata.Help = "The scraping was sucessful" | ||
case "scrape_duration_seconds": | ||
metadata.Unit = "seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above. We should use "seconds" here.
metadata.Type = textparse.MetricTypeGauge | ||
metadata.Help = "The approximate number of new series in this scrape" | ||
case "scrape_samples_post_metric_relabeling": | ||
metadata.Unit = "count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no unit here. See the above comment.
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part... |
8481a50
to
e2c4d6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to rebase this, and get tests working: main...dashpole:internal_metrics_second. The additional changes I had to make are in the last commit, but the rebase was the harder part...
Good job on rebase.... ! I updated my branch. I will take a look if something else needs to be changed.
Ok, without more changes it seems work as before. |
They are introducing the change slowly over a few PRs. The ones they added aren't used yet. They may need to make some changes when they rebase on this change (assuming it merges). |
It appears that the test issues have been resolved and the current failure is a (potentially flaky) load test. Can @open-telemetry/collector-maintainers confirm this and land this PR? |
Data dropped due to high memory usage is common with load tests or not ? |
Thanks @gillg All tests are passing finally. 🎉 @bogdandrutu can you please merge. |
This change updates internal code and is meant to alleviate the massive PR #5184 which is our eventual end-goal. No need for tests because the code path isn't used so we have a license to update it towards the end goal. Particularly it: * adds open-telemetry/opentelemetry-collector#3116 to the otlp version. * copies the sortPoints method over verbatim. * sets DataType and Name on pdata metrics. Updates PR #5184
Description:
This duplicates the target of #2918 but in a completly different approach.
It solves #3089 at least but also other related bugs about prometheus autogenerated metrics by its scrapper : https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
Link to tracking Issue:
#3089
Resolves :
open-telemetry/wg-prometheus#8
Maybe open-telemetry/wg-prometheus#41 ?
Testing:
No new tests because we stay standard with other metrics.
I just "wake up" dormant metrics without metadata.
Documentation:
Nothing more, new metrics up and scrape_xxxx will be available internaly, and so at the exporter side.