Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness: Client trusts all received pongs #7

Closed
ThreeFx opened this issue Sep 14, 2021 · 4 comments
Closed

Liveness: Client trusts all received pongs #7

ThreeFx opened this issue Sep 14, 2021 · 4 comments

Comments

@ThreeFx
Copy link
Contributor

ThreeFx commented Sep 14, 2021

Description and Impact

Client pings are answered by isolated replicas (which are not part of any quorum). This may leads to the client learning a view number which is higher than what the quorum agrees on, and subsequently the client's requests are ignored by the quorum, since its view isn't new enough.

Steps to Reproduce the Bug

  1. Apply the below patch
  2. Run ./vopr.sh 5271112275961929105 -OReleaseSafe
    • no progress is made after operation 20
    • it times out after 100_000_000 ticks
  3. Run ./vopr.sh 5271112275961929105
    • on line ~9.8k, the client acceps a newer view number from the isolated replica 2 (search for on_pong:)
    • this view number is included in subsequent requests, causing the quorum of 0, 1, 3 and 4 to ignore the client's messages
    • no retry logic is embedded in the client, thus no recovery takes place and we stall forever

Note that despite me isolating replica 2 only from the cluster, this can already happen with a "simple" full node crash (and no packet drops):

  1. Client sends ping to all replicas.
  2. Replica 2 fires its view change status timeout
  3. Replica 2 increases its view number
  4. Client ping is delivered to Replica 2
  5. Replica 2 responds with view+1
  6. Replica 2 crashes before sending out view_change messages
  7. Client accepts view+1 in on_pong
  8. Cluster stays at view

diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..ea47420 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -13,7 +13,7 @@ pub const replicas_max = 5;
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 1;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..b6fd4a9 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -55,8 +55,8 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
@@ -72,15 +72,26 @@ pub fn main() !void {
         .seed = prng.random.int(u64),
         .network_options = .{
             .packet_simulator_options = .{
+                .replica_count = replica_count,
+                .client_count = client_count,
                 .node_count = node_count,
+
                 .seed = prng.random.int(u64),
+
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+
+                .partition_mode = .isolate_single,
+                .partition_probability = 100,
+                .unpartition_probability = 0,
+                .partition_stability = 10,
+
+                .path_maximum_capacity = 250, // + prng.random.uintLessThan(u8, 20),
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
-                .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .path_clog_probability = 0, //prng.random.uintLessThan(u8, 2),
+
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .packet_replay_probability = 0, //prng.random.uintLessThan(u8, 50),
             },
         },
         .storage_options = .{
@@ -118,6 +129,10 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          partition_mode = {},
+        \\          partition_probability = {}%,
+        \\          unpartition_probability = {}%,
+        \\          partition_stability = {} ticks,
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -142,6 +157,11 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
 
+        cluster.options.network_options.packet_simulator_options.partition_mode,
+        cluster.options.network_options.packet_simulator_options.partition_probability,
+        cluster.options.network_options.packet_simulator_options.unpartition_probability,
+        cluster.options.network_options.packet_simulator_options.partition_stability,
+
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
         cluster.options.storage_options.write_latency_min,
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index 0ec2bc2..3ef918e 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -12,8 +12,16 @@ pub const PacketSimulatorOptions = struct {
     packet_loss_probability: u8,
     packet_replay_probability: u8,
     seed: u64,
+
+    replica_count: u8,
+    client_count: u8,
     node_count: u8,
 
+    partition_mode: PartitionMode,
+    partition_stability: u32,
+    partition_probability: u8,
+    unpartition_probability: u8,
+
     /// The maximum number of in-flight packets a path can have before packets are randomly dropped.
     path_maximum_capacity: u8,
 
@@ -27,11 +35,18 @@ pub const Path = struct {
     target: u8,
 };
 
+pub const PartitionMode = enum {
+    uniform_size,
+    uniform_partition,
+    isolate_single,
+};
+
 /// A fully connected network of nodes used for testing. Simulates the fault model:
 /// Packets may be dropped.
 /// Packets may be delayed.
 /// Packets may be replayed.
 pub const PacketStatistics = enum(u8) {
+    dropped_due_to_partition,
     dropped_due_to_congestion,
     dropped,
     replay,
@@ -58,6 +73,11 @@ pub fn PacketSimulator(comptime Packet: type) type {
         stats: [@typeInfo(PacketStatistics).Enum.fields.len]u32 = [_]u32{0} **
             @typeInfo(PacketStatistics).Enum.fields.len,
 
+        is_partitioned: bool,
+        partition: []bool,
+        replicas: []u8,
+        stability: u32,
+
         pub fn init(allocator: *std.mem.Allocator, options: PacketSimulatorOptions) !Self {
             assert(options.one_way_delay_mean >= options.one_way_delay_min);
             var self = Self{
@@ -71,8 +91,17 @@ pub fn PacketSimulator(comptime Packet: type) type {
                 ),
                 .options = options,
                 .prng = std.rand.DefaultPrng.init(options.seed),
+
+                .is_partitioned = false,
+                .stability = options.partition_stability,
+                .partition = try allocator.alloc(bool, @as(usize, options.replica_count)),
+                .replicas = try allocator.alloc(u8, @as(usize, options.replica_count)),
             };
 
+            for (self.replicas) |_, i| {
+                self.replicas[i] = @intCast(u8, i);
+            }
+
             for (self.paths) |*queue| {
                 queue.* = std.PriorityQueue(Data).init(allocator, Self.order_packets);
                 try queue.ensureCapacity(options.path_maximum_capacity);
@@ -142,9 +171,69 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return min + @floatToInt(u64, @intToFloat(f64, mean - min) * self.prng.random.floatExp(f64));
         }
 
+        fn random_choice(self: *Self, probability: u8) bool {
+            return self.prng.random.uintLessThan(u8, 100) < probability;
+        }
+
+        fn partition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = true;
+            self.stability = self.options.partition_stability;
+
+            switch (self.options.partition_mode) {
+                .uniform_size => {
+                    const sz = self.prng.random.uintAtMost(u8, self.options.replica_count);
+                    self.prng.random.shuffle(u8, self.replicas);
+                    for (self.replicas) |r, i| {
+                        self.partition[r] = i < sz;
+                    }
+                },
+                .uniform_partition => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = self.random_choice(50);
+                    }
+                },
+                .isolate_single => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = false;
+                    }
+                    const n = self.prng.random.uintLessThan(u8, self.options.replica_count);
+                    self.partition[n] = true;
+                },
+            }
+        }
+
+        fn unpartition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = false;
+            self.stability = self.options.partition_stability;
+
+            for (self.replicas) |_, i| {
+                self.partition[i] = false;
+            }
+        }
+
         pub fn tick(self: *Self) void {
             self.ticks += 1;
 
+            if (self.stability > 0) {
+                self.stability -= 1;
+            } else {
+                if (self.is_partitioned) {
+                    if (self.random_choice(self.options.unpartition_probability)) {
+                        self.unpartition_network();
+                        log.alert("unpartitioned network: partition={d}", .{self.partition});
+                    }
+                } else {
+                    if (self.random_choice(self.options.partition_probability)) {
+                        self.partition_network();
+                        log.alert("partitioned network: partition={d}", .{self.partition});
+                    }
+                }
+            }
+
             var from: u8 = 0;
             while (from < self.options.node_count) : (from += 1) {
                 var to: u8 = 0;
@@ -157,6 +246,15 @@ pub fn PacketSimulator(comptime Packet: type) type {
                         if (data.expiry > self.ticks) break;
                         _ = queue.remove();
 
+                        if (self.is_partitioned) {
+                            if (from < self.options.replica_count and to < self.options.replica_count and self.partition[from] != self.partition[to]) {
+                                self.stats[@enumToInt(PacketStatistics.dropped_due_to_partition)] += 1;
+                                log.alert("dropped packet (different partitions): from={} to={}", .{ from, to });
+                                data.packet.deinit(path);
+                                continue;
+                            }
+                        }
+
                         if (self.should_drop()) {
                             self.stats[@enumToInt(PacketStatistics.dropped)] += 1;
                             log.alert("dropped packet from={} to={}.", .{ from, to });

Suggested Fix

I see three reasonable approaches here:

  1. Have the client dis- and reconnect to the cluster after some amount of failed requests.
    • This is a good option in general, as it increases cluster robustness to similar issues (client believing it's ahead of the cluster)
    • This is probably what I would implement, since it is the simplest possible solution to this exact issue.
  2. Forbid replicas to respond to (client) pings in non-normal operation, since it is not guaranteed to return to a normal state.
    • Not sure about the implications of this one, but it seems sensible given that the client fully trusts the pong message it receives
  3. Do random view changes in regular operation
    • This guarantees that the cluster will not "stagnate" at one view, and eventually catch up to whatever the client has received
    • Not sure about the wider implications of this

The Story Behind the Bug

I've implemented network partitioning and am playing around with 5 replicas and one client. 5 replicas is an interesting case since it allows me to isolate one replica completely without (theoretically) compromising both correct- and liveness.

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

I'm having a lot of fun :)

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 14, 2021

I can confirm that only responding to client pings in normal state solves this issue. Patch:

diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index 702e376..2ae5884 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -488,6 +488,14 @@ pub fn Replica(
             if (message.header.client > 0) {
                 assert(message.header.replica == 0);
 
+                // Clients implicitly trust pong responses by all replicas.
+                // This may cause clients to learn a premature view from a
+                // replica doing an unsuccessful view change, denying them the
+                // ability to send requests to the main cluster. By only
+                // replying to client's ping messages when our state is normal
+                // we prevent the client learning premature view numbers.
+                if (self.status == .view_change) return;
+
                 self.send_header_to_client(message.header.client, pong);
             } else if (message.header.replica == self.replica) {
                 log.warn("{}: on_ping: ignoring (self)", .{self.replica});

ThreeFx added a commit to ThreeFx/viewstamped-replication-made-famous that referenced this issue Sep 14, 2021
Pong responses from unsuccessful view changes may cause the client to
prematurely learn a view number, and prevent it from participating in a
stable cluster.

Fixes: tigerbeetle#7
ThreeFx added a commit to ThreeFx/viewstamped-replication-made-famous that referenced this issue Sep 14, 2021
Pong responses from unsuccessful view changes may cause the client to
prematurely learn a view number, and prevent it from participating in a
stable cluster.

Fixes: tigerbeetle#7
jorangreef added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 15, 2021
We must only ever send our view number to a client via a pong message
if we are in normal status. Otherwise, we may be partitioned from the
cluster with a newer view number, leak this to the client, which would
then pass this to the cluster in subsequent client requests, which
would then ignore these client requests with a newer view number,
locking out the client. The principle here is that we must never send
view numbers for views that have not yet started.

Reported-by: @ThreeFx
Refs: tigerbeetle/viewstamped-replication-made-famous#7
@jorangreef jorangreef reopened this Sep 15, 2021
@jorangreef
Copy link
Member

Congrats @ThreeFx on finding another really interesting liveness bug!

Your report was excellent, and we also appreciate how you reduced the state space down to a single replica starting a view change, leaking this view number to the client through a pong message, before crashing. This is such a simple test case.

The impact of the issue is also pernicious, as it's not a clean crash of any replica or client in the cluster.

The updated fix you suggested—to only send a pong message to the client in normal status—is nice and clean and we have pushed the fix (please would you verify that this does not result in further related issues).

We have decided to award you with a $500 liveness bounty. Well done!!!

Thanks to you, Coil will also match a further $50 to the Zig Software Foundation in recognition of their awesome work.

@jorangreef
Copy link
Member

Also stoked to hear that you're having so much fun. We are too, receiving your reports! :)

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 15, 2021

a760a372277c2ea327b390989ae9d28a241a4ca0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants