Metric gaps #47

AM-Dani · 2023-03-27T19:46:26Z

Hello Team,

We are trying to extract metrics from various resources (service bus, storage accounts, and VMs) and we are experiencing gaps several times a day in all of them. For some of these gaps, we clearly see that the problem is on Azure's side (QueryThrottledException), but for others, we don't see anything in the log entries or in the exporter's metrics. This is an example from today, for the ServiceBus:

azurerm_api_request_count (rate):

azurerm_api_request_bucket (30s):

With the following configuration:

endpoints:
  - interval: 1m
    path: /probe/metrics/resourcegraph
    port: metrics
    scrapeTimeout: 55s
    params:
      name:
        - 'azure-metric'
      template:
        - '{name}_{metric}_{aggregation}_{unit}'
      subscription:         
        - '***************'
      resourceType:
        - 'Microsoft.ServiceBus/Namespaces'
      metric:
        - 'ActiveMessages'
        - 'DeadletteredMessages'
        - 'ScheduledMessages'
        - 'IncomingMessages'
        - 'OutgoingMessages'
      interval:
        - 'PT1M'
      timespan:
        - 'PT1M'
      aggregation:
        - 'average'
        - 'total'
      metricFilter:
        - EntityName eq '*'
      metricTop:
        - '500'

No failure found in the metrics exporter logs, and we have the same problems using '/probe/metrics/list' in other resources.

Can you please help me with this?

The text was updated successfully, but these errors were encountered:

mblaschke · 2023-03-28T23:05:01Z

was there a reset of the exporter in that time?

AM-Dani · 2023-03-29T06:54:10Z

No, it is very stable. The last reboot was to test version 22.12.0-beta0, we wanted to check if we were able to solve the gaps with this version. We couldn't, but we kept using it.

jangaraj · 2023-05-02T22:33:27Z

Do you see metric gaps also in the Azure console?

cdavid · 2023-06-24T22:34:05Z

I hit something similar in my usage of the library - sometimes the metrics are missing. I believe our timeout for Prometheus scraping (20 seconds) might be too short in cases when Service Discovery is needed.

@mblaschke - I was considering contributing some extra logging and/or some other ways of understanding what happens under the hood (is service discovery slow, is the metrics fetching slow etc. - maybe restrict it to only when doing --log.debug?). Before I do anything, do you have some thoughts, guidelines, ideas regarding this area?

Thanks!

mblaschke · 2023-06-25T17:37:07Z

@cdavid
if scape exceeds timeout duration you can lookup metrics scrape_duration_seconds from Prometheus. if it's at your limit the scrape took too long.

with latest version you can now switch to subscription scoped metrics (path /probe/metrics) which requests all metrics from the subscription instead from each resource.
this doesn't cover all use cases but reduces the api calls and is much faster.

so i suggest to try the subscription scoped metrics first.
if that's not enough you can still increase concurrency so more requests are triggered at the same time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric gaps #47

Metric gaps #47

AM-Dani commented Mar 27, 2023

mblaschke commented Mar 28, 2023

AM-Dani commented Mar 29, 2023

jangaraj commented May 2, 2023

cdavid commented Jun 24, 2023

mblaschke commented Jun 25, 2023

Metric gaps #47

Metric gaps #47

Comments

AM-Dani commented Mar 27, 2023

mblaschke commented Mar 28, 2023

AM-Dani commented Mar 29, 2023

jangaraj commented May 2, 2023

cdavid commented Jun 24, 2023

mblaschke commented Jun 25, 2023