Skip to content

Health check delay #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 11, 2022
Merged

Conversation

zainkabani
Copy link
Contributor

@zainkabani zainkabani commented Aug 10, 2022

Currently we have introduced latency in transaction mode because we run health checks every time a connection is checked out from this pool.

This PR addresses this issue by only performing a health check after a certain amount of time has elapsed since the last time the server behaved normally. It also adds shard banning logic to the client side for handling errors communicating with the server connection served by the pool since they may not have passed health checks.

src/pool.rs Outdated
.elapsed()
.unwrap()
.as_millis()
> 10000; // TODO: Make this configurable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make this a config

Copy link
Contributor

@levkk levkk Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to replace latest_successful_server_interaction_timestamp with last_healthcheck_timestamp. Also you can probably call it last_healthcheck because Rust is typed and we know it's a timestamp by looking at the variable type declaration (unlike in Ruby/Python).

src/server.rs Outdated
@@ -474,6 +482,8 @@ impl Server {
// Clear the buffer for next query.
self.buffer.clear();

self.latest_successful_server_interaction_timestamp = SystemTime::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting system time is relatively expensive (it's a syscall), so we should minimize those as much as possible. For example, pgbouncer caches the value of time for the duration of the loop.

src/server.rs Outdated
@@ -564,6 +574,11 @@ impl Server {
pub fn process_id(&self) -> i32 {
self.process_id
}

// Get server's latest response timestamp
pub fn latest_successful_server_interaction_timestamp(&self) -> SystemTime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really long name, try last_health_check

@@ -377,7 +386,11 @@ impl ConnectionPool {
/// Ban an address (i.e. replica). It no longer will serve
/// traffic for any new transactions. Existing transactions on that replica
/// will finish successfully or error out to the clients.
pub fn ban(&self, address: &Address, shard: usize) {
pub fn ban(&self, address: &Address, shard: usize, process_id: i32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to explain why ban would need process_id until you read the function code. Maybe it's better to move the stats handling to the caller?

@zainkabani
Copy link
Contributor Author

zainkabani commented Aug 10, 2022

Performance improvements:

Baseline (healthcheck logic entirely disabled):

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 2.447 ms
initial connection time = 11.395 ms
tps = 6538.979897 (without initial connection time)

Before PR:

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 4.453 ms
initial connection time = 18.360 ms
tps = 3593.118431 (without initial connection time)

After PR:

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 2.499 ms
initial connection time = 10.946 ms
tps = 6403.132002 (without initial connection time)

@zainkabani zainkabani changed the title Server check delay Health check delay Aug 10, 2022
src/server.rs Outdated
Ok(_) => Ok(()),
Ok(_) => {
// Successfully sent to server
self.last_healthcheck = SystemTime::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the last health check time, this is the last time you talked to the server. You want to save this time when you actually health check the server in pool.rs.

Copy link
Collaborator

@drdrsh drdrsh Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name is misleading but this is how pgbouncer does these health checks. If I am reading it correctly, it is based on the server latest activity. If the server hasn't had recent activity, it gets checked. Otherwise, we assume everything is good and go forward.

https://github.com/pgbouncer/pgbouncer/blob/5c8a7728f79cf37f5fb040f49b327dbe8908d912/include/bouncer.h#L416

https://github.com/pgbouncer/pgbouncer/blob/aafe242fb8fcab38ccd67194d13ea836412f2af8/src/janitor.c#L149-L155

So perhaps we can just rename this to last_activity?

Copy link
Contributor

@levkk levkk Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Pgbouncer behavior is sometimes a mystery. It's good to understand what it actually does underneath all the configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah originally named it this latest_successful_server_interaction_timestamp which admittedly is long, but changed it based on #118 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_activity sounds good to me.

src/pool.rs Outdated

// Will return error if timestamp is greater than current system time, which it should never be set to
let require_healthcheck =
server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay;
server.last_activity().elapsed().unwrap().as_millis() > healthcheck_delay;

self.stats.client_disconnecting(process_id, address.id);
self.stats
.checkout_time(now.elapsed().as_micros(), process_id, address.id);
self.ban(address, shard, process_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a debug! a little bit above so we know that the health check is still being performed? It's nice to trace the execution of things when debugging things. I tend to organize the log levels as follows:

  • info: good to know under normal operations, e.g. client connected, client disconnected, etc.
  • warn: need to know, something might be wrong.
  • debug: good to know if I'm trying to understand what's really happening underneath.
  • trace: I need to see almost every interaction and byte going through because I think something is terribly wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right above it there's already an error! log which says the replica is being banned. Does that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the health check being issued, not the ban, but GitHub doesn't let me comment on random file lines i think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't always health check i think it's important to log when we do.

Copy link
Contributor

@levkk levkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@levkk levkk merged commit f963b12 into postgresml:main Aug 11, 2022
jmeagher pushed a commit to jmeagher/pgcat that referenced this pull request Feb 17, 2023
* Prevent clients from sticking to old pools after config update (postgresml#113)

* Re-acquire pool at the beginning of Protocol loop

* Fix query router + add tests for recycling behavior

* create a prometheus exporter on a standard http port (postgresml#107)

* create a hyper server and add option to enable it in config

* move prometheus stuff to its own file; update format

* create metric type and help lookup table

* finish the metric help type map

* switch to a boolean and a standard port

* dont emit unimplemented metrics

* fail if curl returns a non 200

* resolve conflicts

* move log out of config.show and into main

* terminating new line

* upgrade curl

* include unimplemented stats

* Validates pgcat is closed after shutdown python tests (postgresml#116)

* Validates pgcat is closed after shutdown python tests

* Fix pgrep logic

* Moves sigterm step to after cleanup to decouple

* Replace subprocess with os.system for running pgcat

* fix docker compose port allocation for local dev (postgresml#117)

change docker compose port to right prometheus port

* Update CONTRIBUTING.md

* Health check delay (postgresml#118)

* initial commit of server check delay implementation

* fmt

* spelling

* Update name to last_healthcheck and some comments

* Moved server tested stat to after require_healthcheck check

* Make health check delay configurable

* Rename to last_activity

* Fix typo

* Add debug log for healthcheck

* Add address to debug log

* Speed up CI a bit (postgresml#119)

* Sleep for 1s

* use premade image

* quicker

* revert shutdown timeout

* Fix debug log (postgresml#120)

* Make prometheus port configurable (postgresml#121)

* Make prometheus port configurable

* Update circleci config

* Statement timeout + replica imbalance fix (postgresml#122)

* Statement timeout

* send error message too

* Correct error messages

* Fix replica inbalance

* disable stmt timeout by default

* Redundant mark_bad

* revert healthcheck delay

* tests

* set it to 0

* reload config again

* pgcat toml

Co-authored-by: Nicholas Dujay <3258756+dat2@users.noreply.github.com>
Co-authored-by: zainkabani <77307340+zainkabani@users.noreply.github.com>
Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com>
Co-authored-by: Pradeep Chhetri <30620077+chhetripradeep@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants