Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric gaps #47

Open
AM-Dani opened this issue Mar 27, 2023 · 5 comments
Open

Metric gaps #47

AM-Dani opened this issue Mar 27, 2023 · 5 comments

Comments

@AM-Dani
Copy link

AM-Dani commented Mar 27, 2023

Hello Team,

We are trying to extract metrics from various resources (service bus, storage accounts, and VMs) and we are experiencing gaps several times a day in all of them. For some of these gaps, we clearly see that the problem is on Azure's side (QueryThrottledException), but for others, we don't see anything in the log entries or in the exporter's metrics. This is an example from today, for the ServiceBus:

image

azurerm_api_request_count (rate):
image

azurerm_api_request_bucket (30s):
image

With the following configuration:

endpoints:
  - interval: 1m
    path: /probe/metrics/resourcegraph
    port: metrics
    scrapeTimeout: 55s
    params:
      name:
        - 'azure-metric'
      template:
        - '{name}_{metric}_{aggregation}_{unit}'
      subscription:         
        - '***************'
      resourceType:
        - 'Microsoft.ServiceBus/Namespaces'
      metric:
        - 'ActiveMessages'
        - 'DeadletteredMessages'
        - 'ScheduledMessages'
        - 'IncomingMessages'
        - 'OutgoingMessages'
      interval:
        - 'PT1M'
      timespan:
        - 'PT1M'
      aggregation:
        - 'average'
        - 'total'
      metricFilter:
        - EntityName eq '*'
      metricTop:
        - '500'

No failure found in the metrics exporter logs, and we have the same problems using '/probe/metrics/list' in other resources.

Can you please help me with this?

@mblaschke
Copy link
Member

was there a reset of the exporter in that time?

@AM-Dani
Copy link
Author

AM-Dani commented Mar 29, 2023

No, it is very stable. The last reboot was to test version 22.12.0-beta0, we wanted to check if we were able to solve the gaps with this version. We couldn't, but we kept using it.

@jangaraj
Copy link

jangaraj commented May 2, 2023

Do you see metric gaps also in the Azure console?

@cdavid
Copy link

cdavid commented Jun 24, 2023

I hit something similar in my usage of the library - sometimes the metrics are missing. I believe our timeout for Prometheus scraping (20 seconds) might be too short in cases when Service Discovery is needed.

@mblaschke - I was considering contributing some extra logging and/or some other ways of understanding what happens under the hood (is service discovery slow, is the metrics fetching slow etc. - maybe restrict it to only when doing --log.debug?). Before I do anything, do you have some thoughts, guidelines, ideas regarding this area?

Thanks!

@mblaschke
Copy link
Member

@cdavid
if scape exceeds timeout duration you can lookup metrics scrape_duration_seconds from Prometheus. if it's at your limit the scrape took too long.

with latest version you can now switch to subscription scoped metrics (path /probe/metrics) which requests all metrics from the subscription instead from each resource.
this doesn't cover all use cases but reduces the api calls and is much faster.

so i suggest to try the subscription scoped metrics first.
if that's not enough you can still increase concurrency so more requests are triggered at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants