-
Notifications
You must be signed in to change notification settings - Fork 223
Health check delay #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check delay #118
Conversation
src/pool.rs
Outdated
.elapsed() | ||
.unwrap() | ||
.as_millis() | ||
> 10000; // TODO: Make this configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to make this a config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to replace latest_successful_server_interaction_timestamp
with last_healthcheck_timestamp
. Also you can probably call it last_healthcheck
because Rust is typed and we know it's a timestamp by looking at the variable type declaration (unlike in Ruby/Python).
src/server.rs
Outdated
@@ -474,6 +482,8 @@ impl Server { | |||
// Clear the buffer for next query. | |||
self.buffer.clear(); | |||
|
|||
self.latest_successful_server_interaction_timestamp = SystemTime::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting system time is relatively expensive (it's a syscall), so we should minimize those as much as possible. For example, pgbouncer caches the value of time for the duration of the loop.
src/server.rs
Outdated
@@ -564,6 +574,11 @@ impl Server { | |||
pub fn process_id(&self) -> i32 { | |||
self.process_id | |||
} | |||
|
|||
// Get server's latest response timestamp | |||
pub fn latest_successful_server_interaction_timestamp(&self) -> SystemTime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really long name, try last_health_check
@@ -377,7 +386,11 @@ impl ConnectionPool { | |||
/// Ban an address (i.e. replica). It no longer will serve | |||
/// traffic for any new transactions. Existing transactions on that replica | |||
/// will finish successfully or error out to the clients. | |||
pub fn ban(&self, address: &Address, shard: usize) { | |||
pub fn ban(&self, address: &Address, shard: usize, process_id: i32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to explain why ban
would need process_id
until you read the function code. Maybe it's better to move the stats handling to the caller?
Performance improvements: Baseline (healthcheck logic entirely disabled):
Before PR:
After PR:
|
src/server.rs
Outdated
Ok(_) => Ok(()), | ||
Ok(_) => { | ||
// Successfully sent to server | ||
self.last_healthcheck = SystemTime::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the last health check time, this is the last time you talked to the server. You want to save this time when you actually health check the server in pool.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name is misleading but this is how pgbouncer does these health checks. If I am reading it correctly, it is based on the server latest activity. If the server hasn't had recent activity, it gets checked. Otherwise, we assume everything is good and go forward.
So perhaps we can just rename this to last_activity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Pgbouncer behavior is sometimes a mystery. It's good to understand what it actually does underneath all the configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah originally named it this latest_successful_server_interaction_timestamp
which admittedly is long, but changed it based on #118 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_activity
sounds good to me.
src/pool.rs
Outdated
|
||
// Will return error if timestamp is greater than current system time, which it should never be set to | ||
let require_healthcheck = | ||
server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay; | |
server.last_activity().elapsed().unwrap().as_millis() > healthcheck_delay; |
self.stats.client_disconnecting(process_id, address.id); | ||
self.stats | ||
.checkout_time(now.elapsed().as_micros(), process_id, address.id); | ||
self.ban(address, shard, process_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a debug!
a little bit above so we know that the health check is still being performed? It's nice to trace the execution of things when debugging things. I tend to organize the log levels as follows:
info
: good to know under normal operations, e.g. client connected, client disconnected, etc.warn
: need to know, something might be wrong.debug
: good to know if I'm trying to understand what's really happening underneath.trace
: I need to see almost every interaction and byte going through because I think something is terribly wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right above it there's already an error! log which says the replica is being banned. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the health check being issued, not the ban, but GitHub doesn't let me comment on random file lines i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we don't always health check i think it's important to log when we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
* Prevent clients from sticking to old pools after config update (postgresml#113) * Re-acquire pool at the beginning of Protocol loop * Fix query router + add tests for recycling behavior * create a prometheus exporter on a standard http port (postgresml#107) * create a hyper server and add option to enable it in config * move prometheus stuff to its own file; update format * create metric type and help lookup table * finish the metric help type map * switch to a boolean and a standard port * dont emit unimplemented metrics * fail if curl returns a non 200 * resolve conflicts * move log out of config.show and into main * terminating new line * upgrade curl * include unimplemented stats * Validates pgcat is closed after shutdown python tests (postgresml#116) * Validates pgcat is closed after shutdown python tests * Fix pgrep logic * Moves sigterm step to after cleanup to decouple * Replace subprocess with os.system for running pgcat * fix docker compose port allocation for local dev (postgresml#117) change docker compose port to right prometheus port * Update CONTRIBUTING.md * Health check delay (postgresml#118) * initial commit of server check delay implementation * fmt * spelling * Update name to last_healthcheck and some comments * Moved server tested stat to after require_healthcheck check * Make health check delay configurable * Rename to last_activity * Fix typo * Add debug log for healthcheck * Add address to debug log * Speed up CI a bit (postgresml#119) * Sleep for 1s * use premade image * quicker * revert shutdown timeout * Fix debug log (postgresml#120) * Make prometheus port configurable (postgresml#121) * Make prometheus port configurable * Update circleci config * Statement timeout + replica imbalance fix (postgresml#122) * Statement timeout * send error message too * Correct error messages * Fix replica inbalance * disable stmt timeout by default * Redundant mark_bad * revert healthcheck delay * tests * set it to 0 * reload config again * pgcat toml Co-authored-by: Nicholas Dujay <3258756+dat2@users.noreply.github.com> Co-authored-by: zainkabani <77307340+zainkabani@users.noreply.github.com> Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com> Co-authored-by: Pradeep Chhetri <30620077+chhetripradeep@users.noreply.github.com>
Currently we have introduced latency in transaction mode because we run health checks every time a connection is checked out from this pool.
This PR addresses this issue by only performing a health check after a certain amount of time has elapsed since the last time the server behaved normally. It also adds shard banning logic to the client side for handling errors communicating with the server connection served by the pool since they may not have passed health checks.