-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow configuring routing decision when no shard is selected #578
Conversation
Interesting. Curious, why not just always randomize the shard? Originally I don't remember having a reason for picking shard 0 as the default, maybe just for easier testing? But overall, we don't have a good reason for sticking to shard 0 when no shard is selected. Having just shard 0 creates a single point of failure and a hot spot for random queries that shouldn't be executing in the first place anyway, like the ruby-pg query. |
src/client.rs
Outdated
@@ -1010,14 +1010,14 @@ where | |||
// SET SHARD TO | |||
Some((Command::SetShard, _)) => { | |||
// Selected shard is not configured. | |||
if query_router.shard() >= pool.shards() { | |||
if query_router.shard().unwrap_or(0) >= pool.shards() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never be None with the current query router implementation but to be safe this should raise a client error if it is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
src/pool.rs
Outdated
if shard >= self.shards() { | ||
return Err(Error::InvalidShardId(shard)); | ||
} else { | ||
self.addresses[shard].iter().collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also range check on addresses here. It panics with the comment based sharding if an invalid shard is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for when shard is higher than the addresses list. With a manually built query and the comment based sharding passing in a shard that's too high results in a panic instead of a client error.
src/pool.rs
Outdated
// random healthy routing mode | ||
if sort_by_error_count { | ||
candidates.sort_by(|a, b| { | ||
b.error_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking that I'm understanding the logic here for shards with errors. They will still be eligible but will be deprioritized right? So they can be cleared if a shard specific query comes in or if all other shards have problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is exactly right.
@levkk I have seen the concept of the "default shard" come up in several documentation (including that of Rails) so it is not an outlandish idea to have a default shard and have that be 0. The reason I went with making this configurable is to avoid breaking this behavior for users. We will probably be using a |
I think configuration here makes sense. I worry that we're adding a lot of obscure configuration settings which somewhat deviates from the original intent of keeping the operation of pgcat as simple as possible, but maybe with all the different Postgres setups out there, that goal was not realistic. Overall, if we keep the default values good, I don't see any issues with adding more config options that may not be immediately useful to the typical user. I'll read through the PR in a bit, please feel free to merge and use it, if you need it asap. |
src/client.rs
Outdated
// Set the shard back to what it was. | ||
query_router.set_shard(current_shard); | ||
|
||
error_response( | ||
&mut self.write, | ||
&format!( | ||
"shard {} is more than configured {}, staying on shard {} (shard numbers start at 0)", | ||
"shard {:?} is more than configured {}, staying on shard {:?} (shard numbers start at 0)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't leak debug rendering into client responses. I'd prefer we unwrap here if that's the path we're taking with Option
.
src/pool.rs
Outdated
.iter() | ||
.filter(|address| address.role == role) | ||
.collect(); | ||
let mut effective_shard_id: Option<usize> = shard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let effective_shard_id = if self.shards() == 1 {
Some(0)
} else {
shard
};
} | ||
} | ||
}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use .filter
instead of having a mutable reference. Mutable references require the code reader to double-check where this reference is mutated in the code, and often is a signal for potentially buggy code. In this case, the code is fine, and it triggered my spidey sense unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutable reference is not new
https://github.com/postgresml/pgcat/blob/main/src/pool.rs#L610-L613
We need it to be able to shuffle the candidate list
src/pool.rs
Outdated
sort_by_error_count = true; | ||
self.addresses.iter().flatten().collect() | ||
} | ||
NoShardSpecifiedHandling::Shard(shard) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the naming here. How can NoShardSpecifiedHandling
then specify the shard it wants the pool to connect to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I think about it is that the user can either
- Specify a shard they want to connect to using
SET ...
commands or using the comment or some other query parsing mechanism - They don't select a shard, in which case we need to define a default behavior, this could be either random, healthy_random, or some default shard.
I am open for suggestions for the name though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe default_shard
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the config to default_shard
src/pool.rs
Outdated
|
||
// The branch should only be hit if no shard is specified and we are using | ||
// random healthy routing mode | ||
if sort_by_error_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be either shuffle or sort, not both. The reader will be confused whether this is intentional or not, and if it was, it wouldn't be obvious as to why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the code to make it clear what we are doing.
So, we shuffle first because we want to randomize the selected instance even if we end up resorting (e.g. if the sorting metric had the same value for all instances)
@@ -101,6 +104,9 @@ pub struct Address { | |||
|
|||
/// Address stats | |||
pub stats: Arc<AddressStats>, | |||
|
|||
/// Number of errors encountered since last successful checkout | |||
pub error_count: Arc<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we show this in admin? This would be a really cool metric to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be too difficult to expose it. It gets reset with every successful checkout so I am not sure how useful it would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Not so useful then, unless the replica is banned? Could be useful to see how many errors it took before it got successfully banned. Alternatively, we don't reset that number? If it's an i64, it can be incremented ...forever without worrying about overflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about not resetting it too but that meant errors that occurred 2 days ago can affect routing decisions today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
@@ -1234,7 +1241,7 @@ where | |||
{{ \ | |||
pool_name: {}, \ | |||
username: {}, \ | |||
shard: {}, \ | |||
shard: {:?}, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug output going to the client here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This an alternative implementation to #567 based on @jmeagher suggestions.
The TL;DR for the change is that we allow QueryRouter to set the active shard to
None
. This signals to thePool::get
method that we have no shard selected. Theget
method follows ano_shard_specified_behavior
config to know how to route the query.Original PR description
Ruby-pg library makes a startup query to
SET client_encoding to ...
ifEncoding.default_internal
value is set (Code). This query is troublesome because we cannot possibly attach a routing comment to it. PgCat, by default, will route that query to the default shard.Everything is fine until shard 0 has issues, Clients will all be attempting to send this query to shard0 which increases the connection latency significantly for all clients, even those not interested in shard0
This PR introduces
no_shard_specified_behavior
that defines the behavior in case we have routing-by-comment enabled but we get a query without a comment. The allowed behaviors areIn order to achieve this, this PR introduces an
error_count
on theAddress
Object that tracks the number of errors since the last checkout and uses that metric to sort shards by error count before making a routing decision.I didn't want to use address stats to avoid introducing a routing dependency on internal stats (We might do that in the future but I prefer to avoid this for the time being.
I also made changes to the test environment to replace Ruby's TOML reader library, It appears to be abandoned and does not support mixed arrays (which we use in the config toml), and it also does not play nicely with single-quoted regular expressions. I opted for using yj which is a CLI tool that can convert from toml to JSON and back. So I refactor the tests to use that library.