-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTel googlemanagedprometheus can't connect to Google Cloud Monitoring from Cloud Run #31374
Comments
Are you able to reproduce the issue with the sample application? |
v0.87.0 had an issue with the exporter's logging: GoogleCloudPlatform/opentelemetry-operations-go#761, but that should be fixed in v0.89.0. Updating to a newer version might help. I don't think the exporter uses zap, so those logs are probably from a different component. I also see you are using the logging exporter. Do you see the metrics from your application being logged? If not, that might suggest problems getting the metrics to the collector, rather than problems exporting. |
@dashpole Thanks for the response! We see metrics being logged from the logging exporter so I don't think it's an inherent issue with metrics getting to the collector. I actually noticed when we do not have minimum number of instances setup on Cloud Run then it collects metrics fine. But if we have the minimum number of instances setup in Cloud Run (autoscaling.knative.dev/minScale: "1"), then metrics are not exported to Google Cloud monitoring properly and many will be missed. Would you happen to know why or if this has been seen before? Here's a screenshot from our Grafana capturing GCM where we group by the metric's resouce.attributes.service.instance.id from Cloud Run. We were not using minimum number of instances until 02/09, and I retried using having no minimum instances again in the evening of 02/26 and it began collecting metrics fine again. I haven't tried reproducing this with the sample app yet |
cc @braydonk @ridwanmsharif in-case have you seen any issues before with minScale. @matthewcyy can you mention you've seen the issue on v0.87 and v0.89. Did the issue appear when you upgraded from a previous version of the collector? Do you know what version that was? Or were you always on v0.87, and the minScale change triggered the issue? For scaling related issues, i've sometimes found that removing batching and queueing can solve it (by making the pipeline synchronous). Are you able to try removing the batch processor and setting |
@dashpole Yes this issue was seen on v0.87 and v0.89. The issue only appeared once we changed the minimum number of instances, but we weren't aware at the time so we tried upgrading to v0.89 which didn't resolve the issue. Will try removing the batch processor as well and let you know, the config file should look like this then correct?
|
We're receiving this error now when we have sending_queue disabled:
|
Thats odd... and probably unrelated to queueing being enabled or disabled. Usually that error occurs when sending the same timeseries twice within a short period of time. Is that error transient, or are you still not seeing any metrics in cloud monitoring? |
We made the config changes on our staging env and that's also where the log is coming from. We're not seeing most metrics, and not seeing any histogram metrics in cloud monitoring on staging after these changes. We've seen a "One or more TimeSeries could not be written" error before on cloud monitoring, and we've seeing the "Try enabling sending_queue" several times recently after the changes |
"Try enabling sending_queue" is just appended to all error messages when you have it disabled. And to confirm, removing the The only other thing that comes to mind is that the GMP exporter doesn't support exponential histograms, in-case the application recently switched to sending those. IIRC, warnings would appear in your logs if that was the case, though. |
Otherwise, its hard to try and find the issue without a reproduction. If you can find a simple app that reproduces it, that would be helpful. The other approach we can take to debug is to use the file exporter to get the OTLP you are trying to send in json format. I can replay that in my own project to try and reproduce the issue. |
Ok I'll first try reproducing the issue with the sample app and collector from this tutorial since this is where we started with OTel https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar. I plan to use a python server similar to the Golang server being used in the tutorial and using OTLP with the same config we're currently using to see if the issue is reproducible |
@dashpole I am also seeing If Container A and Container B are sending the same histogram metric e.g. duration with same labels to collector, then from collector it goes to GCP. Does GCP discard one of those histogram metric values? |
@AkselAllas if the resource attributes are also the same between container A and container B, they will collide and you will get that error (or other errors). To fix it, make sure you are using the GCP resource detector, and, if on k8s, use the k8sattributes processor, and set |
I'm on cloud run and I don't see specific container id(instance id not revision id) here: But if I add an UUID to service.instance.id resource attribute, will I be fine? |
Looks like that one doens't have support for cloud run. Try using the detector from the GoogleCloudPlatform, similar to this: https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/blob/main/samples/metrics/index.js#L38 |
Working with @matthewcyy on this issue. updated code to use opentelemetry-resourcedetector-gcp python library, and it seems it support cloud run but still having the issue
and this error never goes away(until the cloud run instance terminated). I see this error for the same metric every minute. But I see that metric is created on Google managed Prometheus and there are some intervals. Error:
When I query the metric with this instance ID, I can see my data in the metric explorer.
I download the metric data in CSV on metric explorer. I see that it has data for every 10 second. Is this expected? in our case, we got a question from user and create a metric for it(just and example metric)
|
For me:
Didn't work Worked |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@iydoowii sorry this fell through the cracks. I added some documentation that should help explain these sorts of errors: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/googlemanagedprometheusexporter/README.md#troubleshooting. I suspect you have "colliding" timeseries, which is why you can see metrics in cloud monitoring, but also get errors when exporting. Since it is python, one possibility is that the conflicts are from different processes, which is a common issue. It might be similar to this example https://cloud.google.com/trace/docs/setup/python-ot#config-otel:
|
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
OTel Collector, googlemanagedprometheus exporter
What happened?
Description
With OTel images otel/opentelemetry-collector-contrib:0.89.0 and otel/opentelemetry-collector-contrib:0.87.0, I cannot see metrics being exported to Google Cloud Monitoring via googlemanagedprometheus. It was working fine for a few months until a couple weeks ago, but there haven't been any changes to the config or permissions.
Steps to Reproduce
Not very clear. There's some kind of regression since collector was exporting to Google Cloud monitoring just fine before but there weren't any changes to the collector config. It was originally 0.87 and I tried updating to 0.89 since the k8s OTel collector was using 0.89 and not having these issues.
Expected Result
Metrics visible in Google Cloud Monitoring
Actual Result
Metrics are not visible in Google Cloud Monitoring. Collector is exporting through the logging exporter though, and starting and stopping healthily.
Collector version
Docker images 0.87 & 0.89
Environment information
Environment
Following these steps https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar#ship-code
OpenTelemetry Collector configuration
Log output
Previously with image 0.87 and no logs: level: debug, there were these logs:
The text was updated successfully, but these errors were encountered: