Bigger upstream merge batch #431

kmoppel-cognite · 2022-01-05T16:49:40Z

Seems like a lot of stuff but should be no breaking changes. Main things added:

Some new "approximate size" metrics that skip the real file-system backed functions
Use approximate table / DB sizes automatically for bigger (1TB+) managed Azure Single Server instances due to FS lag
New --emergency-pause-triggerfile flag to make it possible to pause all metrics gathering outside of normal automation / deployment
New --no-helper-functions flag to ignore all metrics relying on 'get_smth()' functions and try the SU version instead. Tuned for managed instances.
New --try-create-listed-exts-if-missing flag to to try to auto-create popular extensions outside of the helpers system that is not applicable on public clouds
Better Prometheus mode cache and connection pool handling

Gatherer: add a new --min-db-size-mb flag to ignore empty DBs

Some new views are more nische, so don't add the to default configs though.

Expand only if the value starts with a "$" - can cause invalid passwords otherwise if a random generated password contains a $ in the middle. This still leaves a hole for cases where random generated password has $ on the 1st positiona, but still better than currently

Go 1.17.2, Ubuntu 20.04

other values. Live queries on dozens of hosts can still cause scrape timeouts

Also add --max-parallel-connections-per-db to make max connections tunable

Also init connections from main loop only now and bundle setting of statement_timeout with the real query

…onfig Also don't check for DB size on scrape if size limiting flag not enabled.

And downscale max. conns for dormant DBs

via PW2_MAX_PARALLEL_CONNECTIONS_PER_DB

Main idea of the feature is to be able to quickly free monitored DBs / network of any extra "monitoring effect" load. In highly automated K8s / IaC environments such a temporary change might involve pull requests, peer reviews, CI/CD etc which can all take too long vs "exec -it pgwatch2-pod -- touch /tmp/pgwatch2-emergency-pause". NB! After creating the file it can still take up to --servers-refresh-loop-seconds (2min def.) for change to take effect!

Currently the startup can freeze for long periods on incorrect host info or in case of real network problems / throttling

Skips metric definitions relying on helper functions. Makes working managed instances a bit more fluid with less errors in logs. Use the SU or superuser version of a metric immediately when available and not after the 1st failed call.

Even dormant DBs should be checked "live" in standard (non-async) Prom mode for "up" state

To put less stress on the monitoring system

This will help in detecting catalog bloat that can massively slow down session startup

Also correct min. version for the SQL definition

To identify the origin of queries one might see in pg_stat_statements

from numeric

These will be anyway dropped on Prom scraper side with the message "Error on ingesting samples that are too old or are too far into the future"

…k_timeout of 5 seconds. This allows setting longer statement timeouts without worries

For YAML / Prom mode only

As seems that under some extreme workloads in connection pooling mode it was not guaranteed that "set stmt_timeout to X; select ... metrics;" was executed on a single connection and timeouts from other sessions became effective for the metric query. Also reduce lock_timeout to 100ms.

For systems with very slow FS - we just use the approximate table size based on relpages. NB! Might be very out of date if no Autovacuum / Vacuum recently

Worked only in sync mode and non-prom modes

Use approximation due to super slow FS access

To be used instead of 'db_size' on large Azure Single Server instances

No need for TX as sequential execution is guaranteed then

metric definition file also

…esult gave 0 result Currentlly we "lie" for up to 10min if no new rows are returned

Main use case is to reduce the annoying 'No such extension pg_stat_statements' errors, although the extension is activated but just not created. Also skipping superuser checks here deliberately, just try to create.

mode Removing ReadOnly flag from go_sql.TxOptions.

pashagolub · 2022-01-09T11:09:53Z

What is the purpose of /* pgwatch2_generated */ comments in SQL files? To determine pgwatch2 queries among others during monitoring?

pgwatch2/metrics/settings/10/metric.sql

pgwatch2/metrics/stat_statements/9.2/metric.sql

pashagolub

Kudos for a huge amount of work! Are these changes were tested somewhere in a field?

kmoppel-cognite · 2022-01-10T15:19:26Z

@pashagolub Yes, it's running at Cognite in production for 100+ DBs with no problems...at least so far :)

pashagolub

OK, my main concern was if the code base is tested in production, so I'm fine to merge it and work on small issues in separate branches to make life easier for everyone. Don't think we should bloat this PR with more commits.

Thanks for a hard work, @kmoppel-cognite! 🤞

kmoppel-cognite and others added 30 commits September 23, 2021 18:21

Merge pull request #5 from cognitedata/add-min-db-size-flag

3075c23

Gatherer: add a new --min-db-size-mb flag to ignore empty DBs

Metrics: add new metrics for the upcoming Postgres v14 release.

268b2b6

Some new views are more nische, so don't add the to default configs though.

Gatherer: protect against "DB unique" names containing ":"

b370e9f

Merge from Cybertec upstream

b6b03c9

Add Prom "PG ver. overview" dash

f02eb8f

Fix a rare lock contention bug in getting DB version code

210c34d

Docker: update daemon Go + Ubuntu base image to latest

c5b70f5

Go 1.17.2, Ubuntu 20.04

Daemon: ignore scrape request until main loop not initialized

8edb7b0

Daemon prom async mode - use also cached "instance_up" value as for

0ce1dc2

other values. Live queries on dozens of hosts can still cause scrape timeouts

Avoid closing sql.DB connections as it has some internal cleverness

50c2084

Also add --max-parallel-connections-per-db to make max connections tunable

Remoce custom limiting of connections and trust sqlx.DB fully

2b76022

Also init connections from main loop only now and bundle setting of statement_timeout with the real query

Cleaning up conn pool when DB removed from config WIP

d158db6

Correct cleaning up of conn pool + async cache when DB removed from c…

bf0b297

…onfig Also don't check for DB size on scrape if size limiting flag not enabled.

Conn pooling - reduce errors for dormant DBs

feca8d2

And downscale max. conns for dormant DBs

Conn pooling - allow setting max. conns also from ENV

331a316

via PW2_MAX_PARALLEL_CONNECTIONS_PER_DB

Daemon: introduce a 5s connect timeout on opening Postgres connections

82bcb48

Currently the startup can freeze for long periods on incorrect host info or in case of real network problems / throttling

Non-async Prom mode: fix instance_state fetching for dormant DBs

8165081

Even dormant DBs should be checked "live" in standard (non-async) Prom mode for "up" state

Preset configs: make 'prometheus-async' a bit more lightweight

af0435c

To put less stress on the monitoring system

Metrics: add "catalog_size_b" to "db_size"

b69c3b5

This will help in detecting catalog bloat that can massively slow down session startup

Metrics: add "wait_events"

58118f3

Add new "wait_events" metrics to the "prometheus-async" preset

4ed96a3

wait_events metric: add also max_query_duration_us in addition to avg.

d40a0ea

Also correct min. version for the SQL definition

Add '/* pgwatch2_generated */' to all metric SQL-s

f186fd5

To identify the origin of queries one might see in pg_stat_statements

Metrics: change db_size.catalog_size_b return type to int8

bb2c88f

from numeric

Metrics: fix stat_ssl to work with Prom mode

bb6b4a1

Prom async mode - don't emit metrics cached for more than 10min

4631997

These will be anyway dropped on Prom scraper side with the message "Error on ingesting samples that are too old or are too far into the future"

Preset configs: stat_ssl had been accidentally commented out

b83b12a

kmoppel-cognite added 19 commits December 7, 2021 16:41

Add a new "Wait events" Prom dash

be5a2be

Gatherer: make metric fetching less disruptive by setting a short loc…

3b7b628

…k_timeout of 5 seconds. This allows setting longer statement timeouts without worries

Metrics: add a new "major_version" column to "settings"

19e8cb7

Gatherer: increase DB / table size fetching timeouts heavily to 5min

9bafa1e

For YAML / Prom mode only

New 'table_stats_approx' metric - a light version of old 'table_stats'

3bf53e5

For systems with very slow FS - we just use the approximate table size based on relpages. NB! Might be very out of date if no Autovacuum / Vacuum recently

Gatherer: make re-mapping of metric names work for async Prom mode

0fc9445

Worked only in sync mode and non-prom modes

Gatherer: use an approximate DB size for Azure Single server

d2bb16e

Reroute table_stats to table_stats_approx for 1TB+ Azure Single Server

daeb3de

Use approximation due to super slow FS access

Metrics: add new 'db_size_approx' metric

dbac80a

To be used instead of 'db_size' on large Azure Single Server instances

Metrics: update db_size.catalog_size_b calculation

b7cc94b

Daemon: simplify GetDBTotalApproxSize SQL

5c9c164

Daemon conn. pool handling - don't use TX if not using pools

c5b9d4a

No need for TX as sequential execution is guaranteed then

Metrics: make db_size_approx more precise - count also for Catalog TOAST

6f09209

Metrics: remove unnecessary sort from table_stats_approx and add to SQL

45b6e9b

metric definition file also

Prom mode cache handling: clear old metric data immediately if last r…

54a257f

…esult gave 0 result Currentlly we "lie" for up to 10min if no new rows are returned

New gatherer flag / option: --try-create-listed-exts-if-missing

78c52ac

Main use case is to reduce the annoying 'No such extension pg_stat_statements' errors, although the extension is activated but just not created. Also skipping superuser checks here deliberately, just try to create.

Gatherer: allow creation of extensions / helpers also in conn. pooling

0b35135

mode Removing ReadOnly flag from go_sql.TxOptions.

Resolved merge conflict from upstream

0a7dd68

pashagolub added the enhancement label Jan 6, 2022

pashagolub reviewed Jan 9, 2022

View reviewed changes

pgwatch2/metrics/settings/10/metric.sql Show resolved Hide resolved

pashagolub reviewed Jan 9, 2022

View reviewed changes

pgwatch2/metrics/stat_statements/9.2/metric.sql Show resolved Hide resolved

pashagolub reviewed Jan 9, 2022

View reviewed changes

pashagolub approved these changes Jan 10, 2022

View reviewed changes

pashagolub assigned kmoppel-cognite Jan 10, 2022

kmoppel-cognite merged commit 0b3c7e6 into cybertec-postgresql:master Jan 12, 2022

kmoppel-cognite deleted the cognite-integration branch January 12, 2022 14:04

kmoppel-cognite restored the cognite-integration branch January 12, 2022 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigger upstream merge batch #431

Bigger upstream merge batch #431

kmoppel-cognite commented Jan 5, 2022

pashagolub commented Jan 9, 2022

pashagolub left a comment

kmoppel-cognite commented Jan 10, 2022

pashagolub left a comment

Bigger upstream merge batch #431

Bigger upstream merge batch #431

Conversation

kmoppel-cognite commented Jan 5, 2022

pashagolub commented Jan 9, 2022

pashagolub left a comment

Choose a reason for hiding this comment

kmoppel-cognite commented Jan 10, 2022

pashagolub left a comment

Choose a reason for hiding this comment