Allow configuring routing decision when no shard is selected #578

drdrsh · 2023-09-04T13:28:06Z

This an alternative implementation to #567 based on @jmeagher suggestions.

The TL;DR for the change is that we allow QueryRouter to set the active shard to None. This signals to the Pool::get method that we have no shard selected. The get method follows a no_shard_specified_behavior config to know how to route the query.

Original PR description
Ruby-pg library makes a startup query to SET client_encoding to ... if Encoding.default_internal value is set (Code). This query is troublesome because we cannot possibly attach a routing comment to it. PgCat, by default, will route that query to the default shard.

Everything is fine until shard 0 has issues, Clients will all be attempting to send this query to shard0 which increases the connection latency significantly for all clients, even those not interested in shard0

This PR introduces no_shard_specified_behavior that defines the behavior in case we have routing-by-comment enabled but we get a query without a comment. The allowed behaviors are

random: Picks a shard at random
random_healthy: Picks a shard at random favoring shards with the least number of recent connection/checkout errors
shard_<number>: e.g. shard_0, shard_4, etc. picks a specific shard, everytime

In order to achieve this, this PR introduces an error_count on the Address Object that tracks the number of errors since the last checkout and uses that metric to sort shards by error count before making a routing decision.
I didn't want to use address stats to avoid introducing a routing dependency on internal stats (We might do that in the future but I prefer to avoid this for the time being.

I also made changes to the test environment to replace Ruby's TOML reader library, It appears to be abandoned and does not support mixed arrays (which we use in the config toml), and it also does not play nicely with single-quoted regular expressions. I opted for using yj which is a CLI tool that can convert from toml to JSON and back. So I refactor the tests to use that library.

levkk · 2023-09-04T18:15:47Z

Interesting. Curious, why not just always randomize the shard? Originally I don't remember having a reason for picking shard 0 as the default, maybe just for easier testing? But overall, we don't have a good reason for sticking to shard 0 when no shard is selected.

Having just shard 0 creates a single point of failure and a hot spot for random queries that shouldn't be executing in the first place anyway, like the ruby-pg query.

jmeagher · 2023-09-04T18:20:40Z

src/client.rs

@@ -1010,14 +1010,14 @@ where
                // SET SHARD TO
                Some((Command::SetShard, _)) => {
                    // Selected shard is not configured.
-                    if query_router.shard() >= pool.shards() {
+                    if query_router.shard().unwrap_or(0) >= pool.shards() {


This should never be None with the current query router implementation but to be safe this should raise a client error if it is None.

jmeagher · 2023-09-04T18:23:40Z

src/pool.rs

+                    if shard >= self.shards() {
+                        return Err(Error::InvalidShardId(shard));
+                    } else {
+                        self.addresses[shard].iter().collect()


Should also range check on addresses here. It panics with the comment based sharding if an invalid shard is passed

Do you mean like -1?

Mostly for when shard is higher than the addresses list. With a manually built query and the comment based sharding passing in a shard that's too high results in a panic instead of a client error.

jmeagher · 2023-09-04T18:25:56Z

src/pool.rs

+        // random healthy routing mode
+        if sort_by_error_count {
+            candidates.sort_by(|a, b| {
+                b.error_count


Just checking that I'm understanding the logic here for shards with errors. They will still be eligible but will be deprioritized right? So they can be cleared if a shard specific query comes in or if all other shards have problems.

Yes. That is exactly right.

drdrsh · 2023-09-04T19:58:52Z

@levkk I have seen the concept of the "default shard" come up in several documentation (including that of Rails) so it is not an outlandish idea to have a default shard and have that be 0.

The reason I went with making this configurable is to avoid breaking this behavior for users. We will probably be using a random routing scheme for unrouted queries and we will experiment with the random_healthy later in the future.

levkk · 2023-09-04T20:32:24Z

I think configuration here makes sense. I worry that we're adding a lot of obscure configuration settings which somewhat deviates from the original intent of keeping the operation of pgcat as simple as possible, but maybe with all the different Postgres setups out there, that goal was not realistic. Overall, if we keep the default values good, I don't see any issues with adding more config options that may not be immediately useful to the typical user.

I'll read through the PR in a bit, please feel free to merge and use it, if you need it asap.

levkk · 2023-09-04T22:07:31Z

src/client.rs

                        // Set the shard back to what it was.
                        query_router.set_shard(current_shard);

                        error_response(
                            &mut self.write,
                            &format!(
-                                "shard {} is more than configured {}, staying on shard {} (shard numbers start at 0)",
+                                "shard {:?} is more than configured {}, staying on shard {:?} (shard numbers start at 0)",


We shouldn't leak debug rendering into client responses. I'd prefer we unwrap here if that's the path we're taking with Option.

levkk · 2023-09-04T22:09:11Z

src/pool.rs

-            .iter()
-            .filter(|address| address.role == role)
-            .collect();
+        let mut effective_shard_id: Option<usize> = shard;


let effective_shard_id = if self.shards() == 1 { Some(0) } else { shard };

levkk · 2023-09-04T22:10:31Z

src/pool.rs

+                    }
+                }
+            },
+        };


You can use .filter instead of having a mutable reference. Mutable references require the code reader to double-check where this reference is mutated in the code, and often is a signal for potentially buggy code. In this case, the code is fine, and it triggered my spidey sense unnecessarily.

The mutable reference is not new
https://github.com/postgresml/pgcat/blob/main/src/pool.rs#L610-L613

We need it to be able to shuffle the candidate list

levkk · 2023-09-04T22:12:14Z

src/pool.rs

+                    sort_by_error_count = true;
+                    self.addresses.iter().flatten().collect()
+                }
+                NoShardSpecifiedHandling::Shard(shard) => {


I'm a bit confused by the naming here. How can NoShardSpecifiedHandling then specify the shard it wants the pool to connect to?

The way I think about it is that the user can either

Specify a shard they want to connect to using SET ... commands or using the comment or some other query parsing mechanism

They don't select a shard, in which case we need to define a default behavior, this could be either random, healthy_random, or some default shard.
I am open for suggestions for the name though

Maybe default_shard?

I renamed the config to default_shard

levkk · 2023-09-04T22:13:31Z

src/pool.rs

+
+        // The branch should only be hit if no shard is specified and we are using
+        // random healthy routing mode
+        if sort_by_error_count {


Should be either shuffle or sort, not both. The reader will be confused whether this is intentional or not, and if it was, it wouldn't be obvious as to why.

I refactored the code to make it clear what we are doing.
So, we shuffle first because we want to randomize the selected instance even if we end up resorting (e.g. if the sorting metric had the same value for all instances)

levkk · 2023-09-04T22:14:42Z

src/config.rs

@@ -101,6 +104,9 @@ pub struct Address {

    /// Address stats
    pub stats: Arc<AddressStats>,
+
+    /// Number of errors encountered since last successful checkout
+    pub error_count: Arc<AtomicU64>,


Can we show this in admin? This would be a really cool metric to expose.

Shouldn't be too difficult to expose it. It gets reset with every successful checkout so I am not sure how useful it would be.

Ah I see. Not so useful then, unless the replica is banned? Could be useful to see how many errors it took before it got successfully banned. Alternatively, we don't reset that number? If it's an i64, it can be incremented ...forever without worrying about overflows.

I thought about not resetting it too but that meant errors that occurred 2 days ago can affect routing decisions today

Good point.

levkk · 2023-09-11T16:34:27Z

src/client.rs

@@ -1234,7 +1241,7 @@ where
                                    {{ \
                                        pool_name: {}, \
                                        username: {}, \
-                                        shard: {}, \
+                                        shard: {:?}, \


Debug output going to the client here.

levkk

LGTM!

drdrsh added 7 commits September 4, 2023 08:27

Allow configuring routing decision when no shard is selected

75836da

fmt + comment

425a216

use a tagged image

7f69133

use pgml ci image

b48c08e

redisable

3ab8424

add config.md

f6ae499

revert

39674bc

drdrsh marked this pull request as ready for review September 4, 2023 16:43

drdrsh requested a review from levkk September 4, 2023 16:43

jmeagher reviewed Sep 4, 2023

View reviewed changes

levkk reviewed Sep 4, 2023

View reviewed changes

clean up code a bit

433bf6e

levkk reviewed Sep 11, 2023

View reviewed changes

levkk approved these changes Sep 11, 2023

View reviewed changes

drdrsh merged commit 0b01d70 into main Sep 11, 2023

levkk deleted the mostafa_default-to-random branch September 11, 2023 20:31

nothrow mentioned this pull request Nov 7, 2024

Wrong configuration key for no_shard_specified_behavior in default pgcat.toml #848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring routing decision when no shard is selected #578

Allow configuring routing decision when no shard is selected #578

drdrsh commented Sep 4, 2023 •

edited

Loading

levkk commented Sep 4, 2023 •

edited

Loading

jmeagher Sep 4, 2023

levkk Sep 4, 2023

jmeagher Sep 4, 2023

drdrsh Sep 6, 2023

jmeagher Sep 6, 2023

jmeagher Sep 4, 2023

drdrsh Sep 4, 2023

drdrsh commented Sep 4, 2023

levkk commented Sep 4, 2023

levkk Sep 4, 2023

levkk Sep 4, 2023

levkk Sep 4, 2023 •

edited

Loading

drdrsh Sep 7, 2023

levkk Sep 4, 2023

drdrsh Sep 7, 2023 •

edited

Loading

drdrsh Sep 7, 2023

drdrsh Sep 7, 2023

levkk Sep 4, 2023

drdrsh Sep 7, 2023

levkk Sep 4, 2023

drdrsh Sep 7, 2023

levkk Sep 8, 2023 •

edited

Loading

drdrsh Sep 8, 2023

levkk Sep 8, 2023

levkk Sep 11, 2023

levkk left a comment

Allow configuring routing decision when no shard is selected #578

Allow configuring routing decision when no shard is selected #578

Conversation

drdrsh commented Sep 4, 2023 • edited Loading

levkk commented Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drdrsh commented Sep 4, 2023

levkk commented Sep 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

levkk Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drdrsh Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

levkk Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

levkk left a comment

Choose a reason for hiding this comment

drdrsh commented Sep 4, 2023 •

edited

Loading

levkk commented Sep 4, 2023 •

edited

Loading

levkk Sep 4, 2023 •

edited

Loading

drdrsh Sep 7, 2023 •

edited

Loading

levkk Sep 8, 2023 •

edited

Loading