Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring routing decision when no shard is selected #578

Merged
merged 8 commits into from
Sep 11, 2023

Conversation

drdrsh
Copy link
Collaborator

@drdrsh drdrsh commented Sep 4, 2023

This an alternative implementation to #567 based on @jmeagher suggestions.

The TL;DR for the change is that we allow QueryRouter to set the active shard to None. This signals to the Pool::get method that we have no shard selected. The get method follows a no_shard_specified_behavior config to know how to route the query.

Original PR description
Ruby-pg library makes a startup query to SET client_encoding to ... if Encoding.default_internal value is set (Code). This query is troublesome because we cannot possibly attach a routing comment to it. PgCat, by default, will route that query to the default shard.

Everything is fine until shard 0 has issues, Clients will all be attempting to send this query to shard0 which increases the connection latency significantly for all clients, even those not interested in shard0

This PR introduces no_shard_specified_behavior that defines the behavior in case we have routing-by-comment enabled but we get a query without a comment. The allowed behaviors are

random: Picks a shard at random
random_healthy: Picks a shard at random favoring shards with the least number of recent connection/checkout errors
shard_<number>: e.g. shard_0, shard_4, etc. picks a specific shard, everytime

In order to achieve this, this PR introduces an error_count on the Address Object that tracks the number of errors since the last checkout and uses that metric to sort shards by error count before making a routing decision.
I didn't want to use address stats to avoid introducing a routing dependency on internal stats (We might do that in the future but I prefer to avoid this for the time being.

I also made changes to the test environment to replace Ruby's TOML reader library, It appears to be abandoned and does not support mixed arrays (which we use in the config toml), and it also does not play nicely with single-quoted regular expressions. I opted for using yj which is a CLI tool that can convert from toml to JSON and back. So I refactor the tests to use that library.

@drdrsh drdrsh marked this pull request as ready for review September 4, 2023 16:43
@drdrsh drdrsh requested a review from levkk September 4, 2023 16:43
@levkk
Copy link
Contributor

levkk commented Sep 4, 2023

Interesting. Curious, why not just always randomize the shard? Originally I don't remember having a reason for picking shard 0 as the default, maybe just for easier testing? But overall, we don't have a good reason for sticking to shard 0 when no shard is selected.

Having just shard 0 creates a single point of failure and a hot spot for random queries that shouldn't be executing in the first place anyway, like the ruby-pg query.

src/client.rs Outdated
@@ -1010,14 +1010,14 @@ where
// SET SHARD TO
Some((Command::SetShard, _)) => {
// Selected shard is not configured.
if query_router.shard() >= pool.shards() {
if query_router.shard().unwrap_or(0) >= pool.shards() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be None with the current query router implementation but to be safe this should raise a client error if it is None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

src/pool.rs Outdated
if shard >= self.shards() {
return Err(Error::InvalidShardId(shard));
} else {
self.addresses[shard].iter().collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also range check on addresses here. It panics with the comment based sharding if an invalid shard is passed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like -1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for when shard is higher than the addresses list. With a manually built query and the comment based sharding passing in a shard that's too high results in a panic instead of a client error.

src/pool.rs Outdated
// random healthy routing mode
if sort_by_error_count {
candidates.sort_by(|a, b| {
b.error_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking that I'm understanding the logic here for shards with errors. They will still be eligible but will be deprioritized right? So they can be cleared if a shard specific query comes in or if all other shards have problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is exactly right.

@drdrsh
Copy link
Collaborator Author

drdrsh commented Sep 4, 2023

@levkk I have seen the concept of the "default shard" come up in several documentation (including that of Rails) so it is not an outlandish idea to have a default shard and have that be 0.

The reason I went with making this configurable is to avoid breaking this behavior for users. We will probably be using a random routing scheme for unrouted queries and we will experiment with the random_healthy later in the future.

@levkk
Copy link
Contributor

levkk commented Sep 4, 2023

I think configuration here makes sense. I worry that we're adding a lot of obscure configuration settings which somewhat deviates from the original intent of keeping the operation of pgcat as simple as possible, but maybe with all the different Postgres setups out there, that goal was not realistic. Overall, if we keep the default values good, I don't see any issues with adding more config options that may not be immediately useful to the typical user.

I'll read through the PR in a bit, please feel free to merge and use it, if you need it asap.

src/client.rs Outdated
// Set the shard back to what it was.
query_router.set_shard(current_shard);

error_response(
&mut self.write,
&format!(
"shard {} is more than configured {}, staying on shard {} (shard numbers start at 0)",
"shard {:?} is more than configured {}, staying on shard {:?} (shard numbers start at 0)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't leak debug rendering into client responses. I'd prefer we unwrap here if that's the path we're taking with Option.

src/pool.rs Outdated
.iter()
.filter(|address| address.role == role)
.collect();
let mut effective_shard_id: Option<usize> = shard;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let effective_shard_id = if self.shards() == 1 {
    Some(0)
} else {
    shard
};

}
}
},
};
Copy link
Contributor

@levkk levkk Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use .filter instead of having a mutable reference. Mutable references require the code reader to double-check where this reference is mutated in the code, and often is a signal for potentially buggy code. In this case, the code is fine, and it triggered my spidey sense unnecessarily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutable reference is not new
https://github.com/postgresml/pgcat/blob/main/src/pool.rs#L610-L613

We need it to be able to shuffle the candidate list

src/pool.rs Outdated
sort_by_error_count = true;
self.addresses.iter().flatten().collect()
}
NoShardSpecifiedHandling::Shard(shard) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the naming here. How can NoShardSpecifiedHandling then specify the shard it wants the pool to connect to?

Copy link
Collaborator Author

@drdrsh drdrsh Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think about it is that the user can either

  • Specify a shard they want to connect to using SET ... commands or using the comment or some other query parsing mechanism
  • They don't select a shard, in which case we need to define a default behavior, this could be either random, healthy_random, or some default shard.
    I am open for suggestions for the name though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe default_shard?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the config to default_shard

src/pool.rs Outdated

// The branch should only be hit if no shard is specified and we are using
// random healthy routing mode
if sort_by_error_count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be either shuffle or sort, not both. The reader will be confused whether this is intentional or not, and if it was, it wouldn't be obvious as to why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code to make it clear what we are doing.
So, we shuffle first because we want to randomize the selected instance even if we end up resorting (e.g. if the sorting metric had the same value for all instances)

@@ -101,6 +104,9 @@ pub struct Address {

/// Address stats
pub stats: Arc<AddressStats>,

/// Number of errors encountered since last successful checkout
pub error_count: Arc<AtomicU64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show this in admin? This would be a really cool metric to expose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be too difficult to expose it. It gets reset with every successful checkout so I am not sure how useful it would be.

Copy link
Contributor

@levkk levkk Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Not so useful then, unless the replica is banned? Could be useful to see how many errors it took before it got successfully banned. Alternatively, we don't reset that number? If it's an i64, it can be incremented ...forever without worrying about overflows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about not resetting it too but that meant errors that occurred 2 days ago can affect routing decisions today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@@ -1234,7 +1241,7 @@ where
{{ \
pool_name: {}, \
username: {}, \
shard: {}, \
shard: {:?}, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug output going to the client here.

Copy link
Contributor

@levkk levkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@drdrsh drdrsh merged commit 0b01d70 into main Sep 11, 2023
@levkk levkk deleted the mostafa_default-to-random branch September 11, 2023 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants