Liveness: Client trusts all received pongs #7

ThreeFx · 2021-09-14T13:37:06Z

Description and Impact

Client pings are answered by isolated replicas (which are not part of any quorum). This may leads to the client learning a view number which is higher than what the quorum agrees on, and subsequently the client's requests are ignored by the quorum, since its view isn't new enough.

Steps to Reproduce the Bug

Apply the below patch
Run ./vopr.sh 5271112275961929105 -OReleaseSafe
- no progress is made after operation 20
- it times out after 100_000_000 ticks
Run ./vopr.sh 5271112275961929105
- on line ~9.8k, the client acceps a newer view number from the isolated replica 2 (search for on_pong:)
- this view number is included in subsequent requests, causing the quorum of 0, 1, 3 and 4 to ignore the client's messages
- no retry logic is embedded in the client, thus no recovery takes place and we stall forever

Note that despite me isolating replica 2 only from the cluster, this can already happen with a "simple" full node crash (and no packet drops):

Client sends ping to all replicas.
Replica 2 fires its view change status timeout
Replica 2 increases its view number
Client ping is delivered to Replica 2
Replica 2 responds with view+1
Replica 2 crashes before sending out view_change messages
Client accepts view+1 in on_pong
Cluster stays at view

diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..ea47420 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -13,7 +13,7 @@ pub const replicas_max = 5;
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 1;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..b6fd4a9 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -55,8 +55,8 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
@@ -72,15 +72,26 @@ pub fn main() !void {
         .seed = prng.random.int(u64),
         .network_options = .{
             .packet_simulator_options = .{
+                .replica_count = replica_count,
+                .client_count = client_count,
                 .node_count = node_count,
+
                 .seed = prng.random.int(u64),
+
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+
+                .partition_mode = .isolate_single,
+                .partition_probability = 100,
+                .unpartition_probability = 0,
+                .partition_stability = 10,
+
+                .path_maximum_capacity = 250, // + prng.random.uintLessThan(u8, 20),
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
-                .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .path_clog_probability = 0, //prng.random.uintLessThan(u8, 2),
+
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .packet_replay_probability = 0, //prng.random.uintLessThan(u8, 50),
             },
         },
         .storage_options = .{
@@ -118,6 +129,10 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          partition_mode = {},
+        \\          partition_probability = {}%,
+        \\          unpartition_probability = {}%,
+        \\          partition_stability = {} ticks,
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -142,6 +157,11 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
 
+        cluster.options.network_options.packet_simulator_options.partition_mode,
+        cluster.options.network_options.packet_simulator_options.partition_probability,
+        cluster.options.network_options.packet_simulator_options.unpartition_probability,
+        cluster.options.network_options.packet_simulator_options.partition_stability,
+
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
         cluster.options.storage_options.write_latency_min,
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index 0ec2bc2..3ef918e 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -12,8 +12,16 @@ pub const PacketSimulatorOptions = struct {
     packet_loss_probability: u8,
     packet_replay_probability: u8,
     seed: u64,
+
+    replica_count: u8,
+    client_count: u8,
     node_count: u8,
 
+    partition_mode: PartitionMode,
+    partition_stability: u32,
+    partition_probability: u8,
+    unpartition_probability: u8,
+
     /// The maximum number of in-flight packets a path can have before packets are randomly dropped.
     path_maximum_capacity: u8,
 
@@ -27,11 +35,18 @@ pub const Path = struct {
     target: u8,
 };
 
+pub const PartitionMode = enum {
+    uniform_size,
+    uniform_partition,
+    isolate_single,
+};
+
 /// A fully connected network of nodes used for testing. Simulates the fault model:
 /// Packets may be dropped.
 /// Packets may be delayed.
 /// Packets may be replayed.
 pub const PacketStatistics = enum(u8) {
+    dropped_due_to_partition,
     dropped_due_to_congestion,
     dropped,
     replay,
@@ -58,6 +73,11 @@ pub fn PacketSimulator(comptime Packet: type) type {
         stats: [@typeInfo(PacketStatistics).Enum.fields.len]u32 = [_]u32{0} **
             @typeInfo(PacketStatistics).Enum.fields.len,
 
+        is_partitioned: bool,
+        partition: []bool,
+        replicas: []u8,
+        stability: u32,
+
         pub fn init(allocator: *std.mem.Allocator, options: PacketSimulatorOptions) !Self {
             assert(options.one_way_delay_mean >= options.one_way_delay_min);
             var self = Self{
@@ -71,8 +91,17 @@ pub fn PacketSimulator(comptime Packet: type) type {
                 ),
                 .options = options,
                 .prng = std.rand.DefaultPrng.init(options.seed),
+
+                .is_partitioned = false,
+                .stability = options.partition_stability,
+                .partition = try allocator.alloc(bool, @as(usize, options.replica_count)),
+                .replicas = try allocator.alloc(u8, @as(usize, options.replica_count)),
             };
 
+            for (self.replicas) |_, i| {
+                self.replicas[i] = @intCast(u8, i);
+            }
+
             for (self.paths) |*queue| {
                 queue.* = std.PriorityQueue(Data).init(allocator, Self.order_packets);
                 try queue.ensureCapacity(options.path_maximum_capacity);
@@ -142,9 +171,69 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return min + @floatToInt(u64, @intToFloat(f64, mean - min) * self.prng.random.floatExp(f64));
         }
 
+        fn random_choice(self: *Self, probability: u8) bool {
+            return self.prng.random.uintLessThan(u8, 100) < probability;
+        }
+
+        fn partition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = true;
+            self.stability = self.options.partition_stability;
+
+            switch (self.options.partition_mode) {
+                .uniform_size => {
+                    const sz = self.prng.random.uintAtMost(u8, self.options.replica_count);
+                    self.prng.random.shuffle(u8, self.replicas);
+                    for (self.replicas) |r, i| {
+                        self.partition[r] = i < sz;
+                    }
+                },
+                .uniform_partition => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = self.random_choice(50);
+                    }
+                },
+                .isolate_single => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = false;
+                    }
+                    const n = self.prng.random.uintLessThan(u8, self.options.replica_count);
+                    self.partition[n] = true;
+                },
+            }
+        }
+
+        fn unpartition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = false;
+            self.stability = self.options.partition_stability;
+
+            for (self.replicas) |_, i| {
+                self.partition[i] = false;
+            }
+        }
+
         pub fn tick(self: *Self) void {
             self.ticks += 1;
 
+            if (self.stability > 0) {
+                self.stability -= 1;
+            } else {
+                if (self.is_partitioned) {
+                    if (self.random_choice(self.options.unpartition_probability)) {
+                        self.unpartition_network();
+                        log.alert("unpartitioned network: partition={d}", .{self.partition});
+                    }
+                } else {
+                    if (self.random_choice(self.options.partition_probability)) {
+                        self.partition_network();
+                        log.alert("partitioned network: partition={d}", .{self.partition});
+                    }
+                }
+            }
+
             var from: u8 = 0;
             while (from < self.options.node_count) : (from += 1) {
                 var to: u8 = 0;
@@ -157,6 +246,15 @@ pub fn PacketSimulator(comptime Packet: type) type {
                         if (data.expiry > self.ticks) break;
                         _ = queue.remove();
 
+                        if (self.is_partitioned) {
+                            if (from < self.options.replica_count and to < self.options.replica_count and self.partition[from] != self.partition[to]) {
+                                self.stats[@enumToInt(PacketStatistics.dropped_due_to_partition)] += 1;
+                                log.alert("dropped packet (different partitions): from={} to={}", .{ from, to });
+                                data.packet.deinit(path);
+                                continue;
+                            }
+                        }
+
                         if (self.should_drop()) {
                             self.stats[@enumToInt(PacketStatistics.dropped)] += 1;
                             log.alert("dropped packet from={} to={}.", .{ from, to });

Suggested Fix

I see three reasonable approaches here:

Have the client dis- and reconnect to the cluster after some amount of failed requests.
- This is a good option in general, as it increases cluster robustness to similar issues (client believing it's ahead of the cluster)
- This is probably what I would implement, since it is the simplest possible solution to this exact issue.
Forbid replicas to respond to (client) pings in non-normal operation, since it is not guaranteed to return to a normal state.
- Not sure about the implications of this one, but it seems sensible given that the client fully trusts the pong message it receives
Do random view changes in regular operation
- This guarantees that the cluster will not "stagnate" at one view, and eventually catch up to whatever the client has received
- Not sure about the wider implications of this

The Story Behind the Bug

I've implemented network partitioning and am playing around with 5 replicas and one client. 5 replicas is an interesting case since it allows me to isolate one replica completely without (theoretically) compromising both correct- and liveness.

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

I'm having a lot of fun :)

The text was updated successfully, but these errors were encountered:

ThreeFx · 2021-09-14T13:48:06Z

I can confirm that only responding to client pings in normal state solves this issue. Patch:

diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index 702e376..2ae5884 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -488,6 +488,14 @@ pub fn Replica(
             if (message.header.client > 0) {
                 assert(message.header.replica == 0);
 
+                // Clients implicitly trust pong responses by all replicas.
+                // This may cause clients to learn a premature view from a
+                // replica doing an unsuccessful view change, denying them the
+                // ability to send requests to the main cluster. By only
+                // replying to client's ping messages when our state is normal
+                // we prevent the client learning premature view numbers.
+                if (self.status == .view_change) return;
+
                 self.send_header_to_client(message.header.client, pong);
             } else if (message.header.replica == self.replica) {
                 log.warn("{}: on_ping: ignoring (self)", .{self.replica});

Pong responses from unsuccessful view changes may cause the client to prematurely learn a view number, and prevent it from participating in a stable cluster. Fixes: tigerbeetle#7

@ThreeFx

We must only ever send our view number to a client via a pong message if we are in normal status. Otherwise, we may be partitioned from the cluster with a newer view number, leak this to the client, which would then pass this to the cluster in subsequent client requests, which would then ignore these client requests with a newer view number, locking out the client. The principle here is that we must never send view numbers for views that have not yet started. Reported-by: @ThreeFx Refs: tigerbeetle/viewstamped-replication-made-famous#7

jorangreef · 2021-09-15T08:31:35Z

Congrats @ThreeFx on finding another really interesting liveness bug!

Your report was excellent, and we also appreciate how you reduced the state space down to a single replica starting a view change, leaking this view number to the client through a pong message, before crashing. This is such a simple test case.

The impact of the issue is also pernicious, as it's not a clean crash of any replica or client in the cluster.

The updated fix you suggested—to only send a pong message to the client in normal status—is nice and clean and we have pushed the fix (please would you verify that this does not result in further related issues).

We have decided to award you with a $500 liveness bounty. Well done!!!

Thanks to you, Coil will also match a further $50 to the Zig Software Foundation in recognition of their awesome work.

jorangreef · 2021-09-15T08:32:19Z

Also stoked to hear that you're having so much fun. We are too, receiving your reports! :)

ThreeFx · 2021-09-15T09:00:42Z

a760a372277c2ea327b390989ae9d28a241a4ca0

ThreeFx added the Liveness label Sep 14, 2021

jorangreef closed this as completed in 4c183a2 Sep 15, 2021

jorangreef reopened this Sep 15, 2021

jorangreef closed this as completed Sep 15, 2021

jorangreef added the Awarded label Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness: Client trusts all received pongs #7

Liveness: Client trusts all received pongs #7

ThreeFx commented Sep 14, 2021 •

edited

Loading

ThreeFx commented Sep 14, 2021 •

edited

Loading

jorangreef commented Sep 15, 2021

jorangreef commented Sep 15, 2021

ThreeFx commented Sep 15, 2021

Liveness: Client trusts all received pongs #7

Liveness: Client trusts all received pongs #7

Comments

ThreeFx commented Sep 14, 2021 • edited Loading

Description and Impact

Steps to Reproduce the Bug

Suggested Fix

The Story Behind the Bug

Songs Enjoyed During the Production of This Issue

Literature

The Last Word

ThreeFx commented Sep 14, 2021 • edited Loading

jorangreef commented Sep 15, 2021

jorangreef commented Sep 15, 2021

ThreeFx commented Sep 15, 2021

ThreeFx commented Sep 14, 2021 •

edited

Loading

ThreeFx commented Sep 14, 2021 •

edited

Loading