Health check delay #118

zainkabani · 2022-08-10T05:21:13Z

Currently we have introduced latency in transaction mode because we run health checks every time a connection is checked out from this pool.

This PR addresses this issue by only performing a health check after a certain amount of time has elapsed since the last time the server behaved normally. It also adds shard banning logic to the client side for handling errors communicating with the server connection served by the pool since they may not have passed health checks.

zainkabani · 2022-08-10T05:27:39Z

src/pool.rs

+                .elapsed()
+                .unwrap()
+                .as_millis()
+                > 10000; // TODO: Make this configurable


need to make this a config

You want to replace latest_successful_server_interaction_timestamp with last_healthcheck_timestamp. Also you can probably call it last_healthcheck because Rust is typed and we know it's a timestamp by looking at the variable type declaration (unlike in Ruby/Python).

levkk · 2022-08-10T15:26:20Z

src/server.rs

@@ -474,6 +482,8 @@ impl Server {
        // Clear the buffer for next query.
        self.buffer.clear();

+        self.latest_successful_server_interaction_timestamp = SystemTime::now();


Getting system time is relatively expensive (it's a syscall), so we should minimize those as much as possible. For example, pgbouncer caches the value of time for the duration of the loop.

levkk · 2022-08-10T15:26:49Z

src/server.rs

@@ -564,6 +574,11 @@ impl Server {
    pub fn process_id(&self) -> i32 {
        self.process_id
    }
+
+    // Get server's latest response timestamp
+    pub fn latest_successful_server_interaction_timestamp(&self) -> SystemTime {


That's a really long name, try last_health_check

levkk · 2022-08-10T15:28:47Z

src/pool.rs

@@ -377,7 +386,11 @@ impl ConnectionPool {
    /// Ban an address (i.e. replica). It no longer will serve
    /// traffic for any new transactions. Existing transactions on that replica
    /// will finish successfully or error out to the clients.
-    pub fn ban(&self, address: &Address, shard: usize) {
+    pub fn ban(&self, address: &Address, shard: usize, process_id: i32) {


It's hard to explain why ban would need process_id until you read the function code. Maybe it's better to move the stats handling to the caller?

zainkabani · 2022-08-10T16:57:18Z

Performance improvements:

Baseline (healthcheck logic entirely disabled):

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 2.447 ms
initial connection time = 11.395 ms
tps = 6538.979897 (without initial connection time)

Before PR:

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 4.453 ms
initial connection time = 18.360 ms
tps = 3593.118431 (without initial connection time)

After PR:

$ pgbench -t 25000 -c 16 -j 2 -S --protocol extended
pgbench (14.4, server 14.2 (Debian 14.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1
query mode: extended
number of clients: 16
number of threads: 2
number of transactions per client: 25000
number of transactions actually processed: 400000/400000
latency average = 2.499 ms
initial connection time = 10.946 ms
tps = 6403.132002 (without initial connection time)

levkk · 2022-08-11T18:34:31Z

src/server.rs

-            Ok(_) => Ok(()),
+            Ok(_) => {
+                // Successfully sent to server
+                self.last_healthcheck = SystemTime::now();


This is not the last health check time, this is the last time you talked to the server. You want to save this time when you actually health check the server in pool.rs.

The variable name is misleading but this is how pgbouncer does these health checks. If I am reading it correctly, it is based on the server latest activity. If the server hasn't had recent activity, it gets checked. Otherwise, we assume everything is good and go forward.

https://github.com/pgbouncer/pgbouncer/blob/5c8a7728f79cf37f5fb040f49b327dbe8908d912/include/bouncer.h#L416

https://github.com/pgbouncer/pgbouncer/blob/aafe242fb8fcab38ccd67194d13ea836412f2af8/src/janitor.c#L149-L155

So perhaps we can just rename this to last_activity?

Good catch. Pgbouncer behavior is sometimes a mystery. It's good to understand what it actually does underneath all the configs.

yeah originally named it this latest_successful_server_interaction_timestamp which admittedly is long, but changed it based on #118 (comment)

last_activity sounds good to me.

levkk · 2022-08-11T19:26:16Z

src/pool.rs

+
+            // Will return error if timestamp is greater than current system time, which it should never be set to
+            let require_healthcheck =
+                server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay;


Suggested change

server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay;

server.last_activity().elapsed().unwrap().as_millis() > healthcheck_delay;

levkk · 2022-08-11T19:28:28Z

src/pool.rs

-                        self.stats.client_disconnecting(process_id, address.id);
-                        self.stats
-                            .checkout_time(now.elapsed().as_micros(), process_id, address.id);
+                        self.ban(address, shard, process_id);


Could you add a debug! a little bit above so we know that the health check is still being performed? It's nice to trace the execution of things when debugging things. I tend to organize the log levels as follows:

info: good to know under normal operations, e.g. client connected, client disconnected, etc.

warn: need to know, something might be wrong.

debug: good to know if I'm trying to understand what's really happening underneath.

trace: I need to see almost every interaction and byte going through because I think something is terribly wrong.

So right above it there's already an error! log which says the replica is being banned. Does that work?

I was referring to the health check being issued, not the ban, but GitHub doesn't let me comment on random file lines i think.

Now that we don't always health check i think it's important to log when we do.

levkk

Awesome

* Prevent clients from sticking to old pools after config update (postgresml#113) * Re-acquire pool at the beginning of Protocol loop * Fix query router + add tests for recycling behavior * create a prometheus exporter on a standard http port (postgresml#107) * create a hyper server and add option to enable it in config * move prometheus stuff to its own file; update format * create metric type and help lookup table * finish the metric help type map * switch to a boolean and a standard port * dont emit unimplemented metrics * fail if curl returns a non 200 * resolve conflicts * move log out of config.show and into main * terminating new line * upgrade curl * include unimplemented stats * Validates pgcat is closed after shutdown python tests (postgresml#116) * Validates pgcat is closed after shutdown python tests * Fix pgrep logic * Moves sigterm step to after cleanup to decouple * Replace subprocess with os.system for running pgcat * fix docker compose port allocation for local dev (postgresml#117) change docker compose port to right prometheus port * Update CONTRIBUTING.md * Health check delay (postgresml#118) * initial commit of server check delay implementation * fmt * spelling * Update name to last_healthcheck and some comments * Moved server tested stat to after require_healthcheck check * Make health check delay configurable * Rename to last_activity * Fix typo * Add debug log for healthcheck * Add address to debug log * Speed up CI a bit (postgresml#119) * Sleep for 1s * use premade image * quicker * revert shutdown timeout * Fix debug log (postgresml#120) * Make prometheus port configurable (postgresml#121) * Make prometheus port configurable * Update circleci config * Statement timeout + replica imbalance fix (postgresml#122) * Statement timeout * send error message too * Correct error messages * Fix replica inbalance * disable stmt timeout by default * Redundant mark_bad * revert healthcheck delay * tests * set it to 0 * reload config again * pgcat toml Co-authored-by: Nicholas Dujay <3258756+dat2@users.noreply.github.com> Co-authored-by: zainkabani <77307340+zainkabani@users.noreply.github.com> Co-authored-by: Lev Kokotov <levkk@users.noreply.github.com> Co-authored-by: Pradeep Chhetri <30620077+chhetripradeep@users.noreply.github.com>

zainkabani added 2 commits August 10, 2022 01:07

initial commit of server check delay implementation

c5ca61f

fmt

d999c93

zainkabani commented Aug 10, 2022

View reviewed changes

spelling

d3c9d62

levkk reviewed Aug 10, 2022

View reviewed changes

Update name to last_healthcheck and some comments

4641a57

Moved server tested stat to after require_healthcheck check

49883fb

zainkabani changed the title ~~Server check delay~~ Health check delay Aug 10, 2022

Make health check delay configurable

bc67a79

levkk reviewed Aug 11, 2022

View reviewed changes

zainkabani added 4 commits August 11, 2022 17:19

Rename to last_activity

5381e59

Fix typo

db6fa6c

Add debug log for healthcheck

d1d5465

Add address to debug log

f56a636

levkk approved these changes Aug 11, 2022

View reviewed changes

levkk merged commit f963b12 into postgresml:main Aug 11, 2022

	server.last_healthcheck().elapsed().unwrap().as_millis() > healthcheck_delay;
	server.last_activity().elapsed().unwrap().as_millis() > healthcheck_delay;

Health check delay #118

Health check delay #118

Uh oh!

Conversation

zainkabani commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zainkabani commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drdrsh Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zainkabani commented Aug 10, 2022 •

edited

Loading

levkk Aug 10, 2022 •

edited

Loading

zainkabani commented Aug 10, 2022 •

edited

Loading

drdrsh Aug 11, 2022 •

edited

Loading

levkk Aug 11, 2022 •

edited

Loading