Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics only produced for few initial scrapes #163

Closed
jakob-reesalu opened this issue Nov 11, 2022 · 7 comments
Closed

Metrics only produced for few initial scrapes #163

jakob-reesalu opened this issue Nov 11, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@jakob-reesalu
Copy link

jakob-reesalu commented Nov 11, 2022

Describe the bug
One of my collectors produce metrics during a few initial scrapes, like 4-5, then no more metrics are shown on the /metrics endpoint. The collector uses the query mentioned in this issue: #154 (comment), but since the query now has proved to work, I suspect there must be some other issue at play here.

To Reproduce

  1. Run collector with query from issue above

Expected behavior
Metrics produced every 6 hours according to min_interval, see below config. Even if the query fails to produce new metrics, surely the cached metrics should still show up in the /metrics endpoint, no?

Configuration
Prometheus config:

global:
  scrape_interval:     60s 
  evaluation_interval: 1m
  scrape_timeout: 15s

Sql-exporter config:
Note the min_interval of 6h, so scrapes are only done every 6 hours

global:
  scrape_timeout: 1h
  scrape_timeout_offset: 500ms
  min_interval: 6h 
  max_connections: 3
  max_idle_connections: 3
  max_connection_lifetime: 5m

Not sure, but might the issue be that Prometheus scrape_timeout: 15s cuts off the Sql-exporter scrape_timeout: 1h, so that any sql-exporter query running for longer that 15s fail? Do I perhaps need to manually set scrape_timeout: 1h on the sql-exporter job in Prometheus?

Additional information:
One of my sql-exporters would after a day or so not provide any metrics at all, but only errors on the /metric endpoint. It would say like Error gathering metrics: [from Gatherer #1], for each collector, saying that "context deadline exceeded". Not sure if this is related or not?

I have been running sql-exporter as a windows service so don't have any any logs of these issues. Now I've restarted them running from command line with debug logging, to monitor this further.

@jakob-reesalu jakob-reesalu added the bug Something isn't working label Nov 11, 2022
@burningalchemist
Copy link
Owner

Hey @jakob-reesalu, I was pretty busy with my daily job for the past two weeks. I'll take a look at this issue and also will follow up on the other one. 👍

@jakob-reesalu
Copy link
Author

@burningalchemist Ya no worries!

@jakob-reesalu
Copy link
Author

Update:
To get logging I've run my sql-exporters from command line now. One weird thing I've seen when running from command line is that everything seems to get locked up until I press up/down arrow (might be that any key press works) in the command line window.

For example: I refresh the /metrics page and it just gets stuck loading, then I bring up the command window and press up or down arrow and suddenly I see logging about collected metrics, at this point the /metrics page stops loading and presents the metrics.

So, coming back from the week-end I had logs showing "error gathering metrics", until I pressed up/down arrows in the command window, then the exporter logged that metrics were gathered:

     [Just lots of errors most of week-end]
...
ts=2022-11-14T05:58:17.379Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded"
ts=2022-11-14T06:58:16.894Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded"
ts=2022-11-14T07:58:16.410Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded"




     [So here I press a up or down arrow, now suddenly we get "Collecting fresh metrics...". Some errors but then it all starts working.]




ts=2022-11-14T08:11:46.823Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Collecting fresh metrics: min_interval=21600.000s cache_age=234005.553s"
ts=2022-11-14T08:11:46.823Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 2 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\", query=\"<table1>s_count\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"table_counts\", query=\"<table2>s_count\"] context deadline exceeded"
ts=2022-11-14T08:11:46.823Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Collecting fresh metrics: min_interval=3600.000s cache_age=230405.467s"
ts=2022-11-14T08:11:46.824Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Collecting fresh metrics: min_interval=21600.000s cache_age=234005.553s"
ts=2022-11-14T08:11:46.824Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 2 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\", query=\"index_fragmentation_<table1>_<table2>_<table3>\"] context deadline exceeded"
ts=2022-11-14T08:13:20.163Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database count]\"collector=\"table_counts\", query=\"<table1>s_count\""
ts=2022-11-14T08:13:23.571Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database count]\"collector=\"table_counts\", query=\"<table2>s_count\""
ts=2022-11-14T08:15:46.414Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database table index avgFragmentationInPercent]\"collector=\"index_fragmentation_checks\", query=\"index_fragmentation_<table1>_<table2>_<table3>\""
ts=2022-11-14T08:15:57.153Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Returning cached metrics: min_interval=21600.000s cache_age=318.392s"
ts=2022-11-14T08:15:57.154Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Returning cached metrics: min_interval=3600.000s cache_age=318.392s"
ts=2022-11-14T08:15:57.250Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"refreshtoken_cleanup_job\""
ts=2022-11-14T08:15:57.305Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"<table1>s_cleanup_job\""
ts=2022-11-14T08:15:57.885Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"<table2>s_cleanup_job\""
ts=2022-11-14T08:15:57.885Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Returning cached metrics: min_interval=21600.000s cache_age=318.392s"
ts=2022-11-14T08:20:24.757Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Returning cached metrics: min_interval=21600.000s cache_age=1327.962s"
ts=2022-11-14T08:20:24.758Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Returning cached metrics: min_interval=21600.000s cache_age=1327.963s"
ts=2022-11-14T08:20:24.758Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Returning cached metrics: min_interval=3600.000s cache_age=1327.963s"

@jakob-reesalu
Copy link
Author

Update: To get logging I've run my sql-exporters from command line now. One weird thing I've seen when running from command line is that everything seems to get locked up until I press up/down arrow (might be that any key press works) in the command line window.

For example: I refresh the /metrics page and it just gets stuck loading, then I bring up the command window and press up or down arrow and suddenly I see logging about collected metrics, at this point the /metrics page stops loading and presents the metrics.

So, coming back from the week-end I had logs showing "error gathering metrics", until I pressed up/down arrows in the command window, then the exporter logged that metrics were gathered:

     [Just lots of errors most of week-end]
...
ts=2022-11-14T05:58:17.379Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded"
ts=2022-11-14T06:58:16.894Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded"
ts=2022-11-14T07:58:16.410Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 3 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"cleanup_jobs\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\"] context deadline exceeded"




     [So here I press a up or down arrow, now suddenly we get "Collecting fresh metrics...". Some errors but then it all starts working.]




ts=2022-11-14T08:11:46.823Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Collecting fresh metrics: min_interval=21600.000s cache_age=234005.553s"
ts=2022-11-14T08:11:46.823Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 2 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\", query=\"<table1>s_count\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"table_counts\", query=\"<table2>s_count\"] context deadline exceeded"
ts=2022-11-14T08:11:46.823Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Collecting fresh metrics: min_interval=3600.000s cache_age=230405.467s"
ts=2022-11-14T08:11:46.824Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Collecting fresh metrics: min_interval=21600.000s cache_age=234005.553s"
ts=2022-11-14T08:11:46.824Z caller=klog.go:84 level=debug func=Infof msg="Error gathering metrics: 2 error(s) occurred:\n* [from Gatherer #1] [collector=\"table_counts\"] context deadline exceeded\n* [from Gatherer #1] [collector=\"index_fragmentation_checks\", query=\"index_fragmentation_<table1>_<table2>_<table3>\"] context deadline exceeded"
ts=2022-11-14T08:13:20.163Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database count]\"collector=\"table_counts\", query=\"<table1>s_count\""
ts=2022-11-14T08:13:23.571Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database count]\"collector=\"table_counts\", query=\"<table2>s_count\""
ts=2022-11-14T08:15:46.414Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database table index avgFragmentationInPercent]\"collector=\"index_fragmentation_checks\", query=\"index_fragmentation_<table1>_<table2>_<table3>\""
ts=2022-11-14T08:15:57.153Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Returning cached metrics: min_interval=21600.000s cache_age=318.392s"
ts=2022-11-14T08:15:57.154Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Returning cached metrics: min_interval=3600.000s cache_age=318.392s"
ts=2022-11-14T08:15:57.250Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"refreshtoken_cleanup_job\""
ts=2022-11-14T08:15:57.305Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"<table1>s_cleanup_job\""
ts=2022-11-14T08:15:57.885Z caller=klog.go:55 level=debug func=Verbose.Infof msg="returned_columns=\"[database cleanupPerformed]\"collector=\"cleanup_jobs\", query=\"<table2>s_cleanup_job\""
ts=2022-11-14T08:15:57.885Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Returning cached metrics: min_interval=21600.000s cache_age=318.392s"
ts=2022-11-14T08:20:24.757Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"table_counts\"] Returning cached metrics: min_interval=21600.000s cache_age=1327.962s"
ts=2022-11-14T08:20:24.758Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"cleanup_jobs\"] Returning cached metrics: min_interval=21600.000s cache_age=1327.963s"
ts=2022-11-14T08:20:24.758Z caller=klog.go:55 level=debug func=Verbose.Infof msg="[collector=\"index_fragmentation_checks\"] Returning cached metrics: min_interval=3600.000s cache_age=1327.963s"

Actually! I'm running in Powershell and googled on this, found out this is due to a Powershell setting described in this answer: https://serverfault.com/questions/204150/sometimes-powershell-stops-sending-output-until-i-press-enter-why

So that's solved now and no Sql-Exporter issue :)

@burningalchemist
Copy link
Owner

burningalchemist commented Aug 24, 2023

Hey @jakob-reesalu, did you eventually figure out the configuration?

I believe in the end it's related to the timeout configured on the Prometheus side. If the connection is cut after 15s, sql_exporter also cancels the request since there's no receiver to respond to.

I'm going to close the issue as stale/solved. Feel free to reopen it. 👍

Cheers!

@jakob-reesalu
Copy link
Author

jakob-reesalu commented Aug 24, 2023

Hey man @burningalchemist!

I don't recall exactly what I did in the end. Unfortunately I haven't reached the "end" fully as we're still having issues with the exporter. =/ As of right now we're getting one or perhaps some scrapes for a day, then Prometheus alerts that the exporter is down but the service is still running on the DB machine. Even if the service is running the /metrics page shows errors though, instead of the metrics that it previously managed to get. Not sure if our issue relates to what you've added in the start page:

"Configuration
SQL Exporter is deployed alongside the DB server it collects metrics from. If both the exporter and the DB server are on the same host, they will share the same failure domain: they will usually be either both up and running or both down. When the database is unreachable, /metrics responds with HTTP code 500 Internal Server Error, causing Prometheus to record up=0 for that scrape. Only metrics defined by collectors are exported on the /metrics endpoint. SQL Exporter process metrics are exported at /sql_exporter_metrics."

The thing is that the database is up and running in our case so it doesn't seem to be unreachable.

I'll see if I get time to come back to this in the future.

@burningalchemist
Copy link
Owner

@jakob-reesalu Got it, thanks! Yeah, please re-open once you have some time, and maybe we can run over it again, I'm happy to assist. In the meantime I'll think about it. There might be many factors when the query is long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants