Skip to content
This repository has been archived by the owner on Dec 17, 2024. It is now read-only.

Bigger upstream merge batch #431

Merged

Conversation

kmoppel-cognite
Copy link
Collaborator

Seems like a lot of stuff but should be no breaking changes. Main things added:

  • Some new "approximate size" metrics that skip the real file-system backed functions
  • Use approximate table / DB sizes automatically for bigger (1TB+) managed Azure Single Server instances due to FS lag
  • New --emergency-pause-triggerfile flag to make it possible to pause all metrics gathering outside of normal automation / deployment
  • New --no-helper-functions flag to ignore all metrics relying on 'get_smth()' functions and try the SU version instead. Tuned for managed instances.
  • New --try-create-listed-exts-if-missing flag to to try to auto-create popular extensions outside of the helpers system that is not applicable on public clouds
  • Better Prometheus mode cache and connection pool handling

kmoppel-cognite and others added 30 commits September 23, 2021 18:21
Gatherer: add a new --min-db-size-mb flag to ignore empty DBs
Some new views are more nische, so don't add the to default configs though.
Expand only if the value starts with a "$" - can cause invalid passwords
otherwise if a random generated password contains a $ in the middle.
This still leaves a hole for cases where random generated password has $
on the 1st positiona, but still better than currently
other values. Live queries on dozens of hosts can still cause scrape timeouts
Also add --max-parallel-connections-per-db to make max connections
tunable
Also init connections from main loop only now and bundle setting of
statement_timeout with the real query
…onfig

Also don't check for DB size on scrape if size limiting flag not
enabled.
And downscale max. conns for dormant DBs
via PW2_MAX_PARALLEL_CONNECTIONS_PER_DB
Main idea of the feature is to be able to quickly free monitored DBs / network of any extra "monitoring effect" load.
In highly automated K8s / IaC environments such a temporary change might involve pull requests, peer reviews, CI/CD etc
which can all take too long vs "exec -it pgwatch2-pod -- touch /tmp/pgwatch2-emergency-pause".
NB! After creating the file it can still take up to --servers-refresh-loop-seconds (2min def.) for change to take effect!
Currently the startup can freeze for long periods on incorrect host info
or in case of real network problems / throttling
Skips metric definitions relying on helper functions. Makes working
managed instances a bit more fluid with less errors in logs. Use the SU
or superuser version of a metric immediately when available and not
after the 1st failed call.
Even dormant DBs should be checked "live" in standard (non-async) Prom
mode for "up" state
To put less stress on the monitoring system
This will help in detecting catalog bloat that can massively slow down
session startup
Also correct min. version for the SQL definition
To identify the origin of queries one might see in pg_stat_statements
These will be anyway dropped on Prom scraper side with the message
"Error on ingesting samples that are too old or are too far into the future"
…k_timeout

of 5 seconds. This allows setting longer statement timeouts without
worries
As seems that under some extreme workloads in connection pooling mode
it was not guaranteed that "set stmt_timeout to X; select ... metrics;"
was executed on a single connection and timeouts from other sessions
became effective for the metric query.

Also reduce lock_timeout to 100ms.
For systems with very slow FS - we just use the approximate table size
based on relpages. NB! Might be very out of date if no Autovacuum /
Vacuum recently
Worked only in sync mode and non-prom modes
Use approximation due to super slow FS access
To be used instead of 'db_size' on large Azure Single Server instances
No need for TX as sequential execution is guaranteed then
…esult gave 0 result

Currentlly we "lie" for up to 10min if no new rows are returned
Main use case is to reduce the annoying 'No such extension
pg_stat_statements' errors, although the extension is activated but just
not created. Also skipping superuser checks here deliberately, just try
to create.
mode

Removing ReadOnly flag from go_sql.TxOptions.
@pashagolub
Copy link
Collaborator

What is the purpose of /* pgwatch2_generated */ comments in SQL files? To determine pgwatch2 queries among others during monitoring?

Copy link
Collaborator

@pashagolub pashagolub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for a huge amount of work! Are these changes were tested somewhere in a field?

@kmoppel-cognite
Copy link
Collaborator Author

@pashagolub Yes, it's running at Cognite in production for 100+ DBs with no problems...at least so far :)

Copy link
Collaborator

@pashagolub pashagolub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my main concern was if the code base is tested in production, so I'm fine to merge it and work on small issues in separate branches to make life easier for everyone. Don't think we should bloat this PR with more commits.

Thanks for a hard work, @kmoppel-cognite! 🤞

@kmoppel-cognite kmoppel-cognite merged commit 0b3c7e6 into cybertec-postgresql:master Jan 12, 2022
@kmoppel-cognite kmoppel-cognite deleted the cognite-integration branch January 12, 2022 14:04
@kmoppel-cognite kmoppel-cognite restored the cognite-integration branch January 12, 2022 14:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants