When you write a unit test, you write some code and check that it executes
in the way you expect. For example, to test a Reversed([]byte) []byte
function, you might check that
Reversed([]byte{})
is[]byte{}
,Reversed([]byte{1})
is[]byte{1}
,Reversed([]byte{1, 2})
is[]byte{2, 1}
, and so on.
Unit tests are thus limited to only test the behaviors and corner cases that you can think of and have the patience to write down.
With randomized property-based tests, on the other hand, you specify a
property of your code that should always hold and then test the property on
millions of randomly generated examples. For example, we can test that the
Reversed
function is involutive—i.e. that Reversed(Reversed(x)) == x
—by writing a fuzz test that tests the property on randomly
generated byte slices:
func FuzzReversedIsInvolutive(f *testing.F) {
f.Fuzz(func(t *testing.T, want []byte) {
got := Reversed(Reversed(want))
if !slices.Equal(want, got) {
t.Fatalf("got %v, want %v", got, want)
}
})
}
We can test Service Weaver applications in the same way and check that properties of an application always hold, even when we run random operations on random inputs in the face of random failures and random interleavings. For distributed systems, this kind of randomized property-based testing is especially valuable, as many failure inducing corner cases are extremely pathological and hard to think of. Even well studied and heavily scrutinized protocols tend to have subtle bugs.
In this document, we describe three ways to introduce randomized property-based testing to Service Weaver. This doc borrows heavily from Testing Distributed Systems by Andrey Satarin.
The various approaches differ in the following ways.
- What properties can be tested?
- What kind of failures are supported?
- How replayable, readable, and minimal are failing executions?
- How concurrently do things run?
- How complex are the APIs?
With Jepsen-style testing, we run a Service Weaver application with components in their own containers. We execute random client requests against the system, with requests allowed to execute in parallel. We periodically inject failures into the system. For example, we can
- kill a container,
- partition containers from one another,
- suspend a process for some time,
- skew a container's clock,
- corrupt a disk, and so on.
A user has to specify which operations to run, how to generate inputs to these operations, which types of failures to inject, and what properties to check.
Pros. The benefit of Jepsen-style testing is that failures are relatively real. We don't have to simulate a process failing, for example. We can just straight up fail it. Injected failures will never sufficiently capture the full set of failures that can happen in the wild, of course, but they are better than some other approaches which we'll discuss in a moment. Client requests also run in parallel, which allows for some interleavings that other approaches don't cover.
Cons. The disadvantage of Jepsen-style testing is that failure inducing executions are hard to understand, replay, and minimize. An execution can be very long, consisting of thousands of parallel client requests. Replaying and minimizing an execution is impossible due to the non-determinism introduced by executing requests and failures in parallel (this is a problem even if the code itself is deterministic). Running an app across a set of containers can also be slow.
With deterministic simulation, we run a Service Weaver application all within a single process. We execute random client requests concurrently, but not in parallel. Specifically, a client request executes on its own, but when the client request calls a component method, its execution is suspended and control is passed to a centralized scheduler. The scheduler can then
- deliver a pending method call,
- deliver a pending method return,
- deliver an error to a pending method call, or
- start running a new client request.
In this way, execution runs step-by-step, with the simulator deciding exactly which step to run next. Moreover, assuming the application being tested is deterministic, a simulator will repeatedly produce the exact same execution when given the same seed.
Note that failures are implicit in the execution model. To simulate a network partition, messages between the two sides of the partition are not delivered, and a replica failure is simulated as a single-node partition.
As with Jepsen-style tests, a user has to specify which operations to run, how to generate inputs to these operations, which types of failures to inject, and what properties to check. You can find a simple prototype of a deterministic simulator here.
Pros. The benefit of deterministic simulation is that executions are easy to understand, replay, and minimize. Once a bug is found and fixed, failure inducing executions can be replayed to check that the bug fix works. Executions can also be minimized, making it much easier to understand the root cause of the failure.
Deterministic simulation can also enumerate all possible message interleavings, including those that would very rarely occur naturally in a Jepsen-style test.
Deterministic simulation is also quite fast.
Cons. The disadvantage of deterministic simulation is that, unlike Jepsen-style tests, it cannot uncover bugs that arise from parallel execution or realistic failures. For example, consider a bug that only manifests when a component crashes at a very particular point in the middle of its execution. Because deterministic simulation only suspends a component at method call boundaries, it won't uncover this bug.
If the code in a Service Weaver application is not deterministic, then deterministic simulation is less effective. A failing execution may not always fail, and minimization may act erratically.
Implementing deterministic simulation is also more complex than implementing Jepsen-style tests, as the simulator has to cleverly inject itself into the code at every method call.
Chaos testing a Service Weaver app is similar to Jepsen testing it. We run the application in a set of containers, run a random workload against it, and fail things periodically. The difference is that Jepsen-style testing tests properties that should always hold, no matter how many failures there are or how severe the failures are. Chaos testing, on the other hand, tends to test that certain metrics don't degrade too much in the presence of a more restricted type of failure. For example, we could test that latency doesn't dip too much in the presence of any one replica failing.
Chaos testing is supposed to be done in production, but it can also be done locally with simulated workloads.
Pros and Cons. The pro and con of chaos testing is that we cannot test complex properties (e.g., linearizability) or complex failures. This makes chaos testing easier to reason about but also less expressive.
I propose we introduce deterministic simulation style testing to weavertest. Later, we may be able to implement Jepsen-style testing as well using the same or very similar API. Both deterministic testing and Jepsen testing require roughly the same things from the user (which operations to run, how to generate arguments to these operations, which failures to inject, which properties to check).