-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness: Client trusts all received pongs #7
Comments
I can confirm that only responding to client pings in normal state solves this issue. Patch: diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index 702e376..2ae5884 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -488,6 +488,14 @@ pub fn Replica(
if (message.header.client > 0) {
assert(message.header.replica == 0);
+ // Clients implicitly trust pong responses by all replicas.
+ // This may cause clients to learn a premature view from a
+ // replica doing an unsuccessful view change, denying them the
+ // ability to send requests to the main cluster. By only
+ // replying to client's ping messages when our state is normal
+ // we prevent the client learning premature view numbers.
+ if (self.status == .view_change) return;
+
self.send_header_to_client(message.header.client, pong);
} else if (message.header.replica == self.replica) {
log.warn("{}: on_ping: ignoring (self)", .{self.replica}); |
Pong responses from unsuccessful view changes may cause the client to prematurely learn a view number, and prevent it from participating in a stable cluster. Fixes: tigerbeetle#7
Pong responses from unsuccessful view changes may cause the client to prematurely learn a view number, and prevent it from participating in a stable cluster. Fixes: tigerbeetle#7
We must only ever send our view number to a client via a pong message if we are in normal status. Otherwise, we may be partitioned from the cluster with a newer view number, leak this to the client, which would then pass this to the cluster in subsequent client requests, which would then ignore these client requests with a newer view number, locking out the client. The principle here is that we must never send view numbers for views that have not yet started. Reported-by: @ThreeFx Refs: tigerbeetle/viewstamped-replication-made-famous#7
Congrats @ThreeFx on finding another really interesting liveness bug! Your report was excellent, and we also appreciate how you reduced the state space down to a single replica starting a view change, leaking this view number to the client through a pong message, before crashing. This is such a simple test case. The impact of the issue is also pernicious, as it's not a clean crash of any replica or client in the cluster. The updated fix you suggested—to only send a pong message to the client in normal status—is nice and clean and we have pushed the fix (please would you verify that this does not result in further related issues). We have decided to award you with a $500 liveness bounty. Well done!!! Thanks to you, Coil will also match a further $50 to the Zig Software Foundation in recognition of their awesome work. |
Also stoked to hear that you're having so much fun. We are too, receiving your reports! :) |
a760a372277c2ea327b390989ae9d28a241a4ca0 |
Description and Impact
Client pings are answered by isolated replicas (which are not part of any quorum). This may leads to the client learning a view number which is higher than what the quorum agrees on, and subsequently the client's requests are ignored by the quorum, since its view isn't new enough.
Steps to Reproduce the Bug
./vopr.sh 5271112275961929105 -OReleaseSafe
100_000_000
ticks./vopr.sh 5271112275961929105
on_pong:
)Note that despite me isolating replica 2 only from the cluster, this can already happen with a "simple" full node crash (and no packet drops):
view
numberview+1
view+1
inon_pong
view
Suggested Fix
I see three reasonable approaches here:
ping
s in non-normal operation, since it is not guaranteed to return to a normal state.The Story Behind the Bug
I've implemented network partitioning and am playing around with 5 replicas and one client. 5 replicas is an interesting case since it allows me to isolate one replica completely without (theoretically) compromising both correct- and liveness.
Songs Enjoyed During the Production of This Issue
Liquicity Yearmix 2020
Literature
No response
The Last Word
I'm having a lot of fun :)
The text was updated successfully, but these errors were encountered: