Tests for the Redpanda distributed queue.
You'll need a Jepsen environment: see [https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment](Jepsen's setup documentation) for details. In short, this means a cluster of 3+ Debian Buster DB nodes to run instances of Redpanda, and a control node to run this test harness. The control node needs a JDK, JNA, Leiningen, Gnuplot, and Graphviz. On the control node, plan on ~24G of memory and about 100 GB of disk, if you want to run a few hundred ~1000 second tests back to back.
Once you have a Jepsen cluster set up, you'll need a copy of this test harness.
That can be a tarball or git clone of this repository, or you can compile this
test to a fat jar using lein uberjar
, copy that jar to your control node,
and invoke it using java -jar <jar-file> <test args ...>
. In these docs,
we'll assume you've got a copy of this directory and are invoking tests via
lein run
.
For these examples, we'll assume you have a file ~/nodes
which has your DB
node hostnames, one per line, and that your user with sudo access on each node
is named `admin. This is the setup you'd get out of the box from the Jepsen
Cloudformation deployment on the AWS
Marketplace. All
these commands should be run in the top-level directory of this repository.
To run a very brief test, just to make sure everything works:
lein run test --nodes-file ~/nodes --username admin
This will likely print something like:
...
:valid? true}
Everything looks good! ヽ(‘ー`)ノ
To observe inconsistent offsets and/or duplicates in Redpanda 21.10.1, try:
lein run test --nodes-file ~/nodes --username admin -s --nemesis pause,kill --time-limit 300 --test-count 5
-s
asks for safer options than the defaults: it turns on idempotence, sets
auto-offset-reset=earliest, etc. --nemesis pause,kill
asks the nemesis to
pause and kill random nodes throughout the test. We also increase the time of
each test to 300 seconds, and run up to 5 tests in a row, stopping as soon as
the test detects an error. This might emit something like:
:workload {:valid? false
:duplicate {:count 1,
:errs ({:key 8, :value 711, :count 2})},
Or we could run the test against version 21.11.2, with process crashes and membership changes, and for significantly longer. This should, after a few hours, demonstrate lost/stale messages, reported as unseen
:
lein run test --nodes-file ~/nodes --username admin -s --no-ww-deps --nemesis kill,membership --time-limit 1000 --test-count 10 --version 21.11.2-1-f58e69b6
To start a web server which lets you browse the results of these tests on port 8080, run:
lein run serve
lein run test
runs a single test--possibly multiple times in a row.
lein run test-all
runs a whole slew of different tests, including multiple
workloads, versions, combinations of nemeses, with and without transactions.
You can constrain any of these--for example, to compare behavior under network
partitions across two different versions, both with and without transactions:
lein run test-all --nodes-file ~/nodes --username admin --nemesis partition --time-limit 300 --versions 21.10.1-1-e7b6714a,21.11.2-1-f58e69b6
lein run serve
runs a web server to browse results of tests from the local
store/
directory. There's nothing magic about this directory--you can tar it
up and copy it around, and it'll still work fine. You can copy one store
directory's contents into another and that'll work too, which might be helpful
for CI or farming out tests to lots of clusters.
All three of these commands provide docs for all of their options: try lein run test -h
for help on running a single test.
In each test directory in store/
you'll find several files. For starters, there's one directory for every DB node, which contains:
data.tar.bz2
: a tarball of that node's Redpanda data directoryredpanda.log
: the stdout/stderr logs of the Redpanda servertcpdump.pcap
: If running with--tcpdump
, a packet capture file of this node's kafka-protocol network traffic.
Three files show the chronology of the test. jepsen.log
has the full output
of the test run--everything printed to the console. history.edn
is a
machine-readable file with all the logical operations Jepsen performed during
the test. history.txt
is the same, but as a tab-aligned text file.
latency-raw.png
and latency-quantiles.png
show the latency of each
operation over time. Colors denote successful (ok), failed (fail), or unknown
(info) operations. Shape denotes the logical function :f
of that
operation--for example, a poll
, subscribe
, txn
. Nemesis activity is shown as colored bars along the top of the plot, and vertical lines whenever the nemesis changes something.
rate.png
shows the rate of requests per second, over time, with the same shapes and colors as the latency plot.
unseen.png
shows the number of messages which were committed but not read by
any poller over time, broken out by key (partition-topic). Two final vertical
lines show the start and stop of the final polling period, when the cluster has
healed and we're trying to read all writes. This graph should go to zero
relatively quickly during the recovery window: if it doesn't, that suggests
some writes have been lost or are severely delayed.
realtime-lag
plots shows how far behind the most recent committed write each
consumer is, over time. These come in several families, aggregated by key, by
worker thread (which maps to one client at any given time), or by neither.
Whenever we subscribe or assign a fresh topic, this might jump up, but it
should head down as consumers catch up to the most recent values. You can use
these plots to identify whether a single key or single thread has gotten stuck,
and whether other threads or keys are making progress.
test.fressian
and test.jepsen
have binary representations of the test in
its entirety.
elle/
shows cycles between transactions, both as text files and as svg graphs. These are categorized by anomaly type--e.g. G0, G1c, etc.
orders
appears whenever we see Weird Things Happen on some key. It contains
one SVG file for each key that did something odd, and shows the view of each
ok
operation into that particular key's offsets. Time flows top to bottom,
and offsets are arranged left to right. Each number is the value of the message
at that particular offset. Hovering over any row shows more information. This
is helpful for understanding when there are reorderings or message loss.
results.edn
has the results of the checker.
results.edn
is a map with several keys. At each layer, the :valid?
key says
whether the results (or part thereof) were considered valid, or if an anomaly was detected.
The stats
part of the analysis provides overall statistics on the number of
operations, how many were successful (ok), failed (fail), or indeterminate
(info). These are also broken down by function, so you can see the behavior of
(e.g.) just poll
operations.
{:stats {:valid? true,
:count 14772,
:ok-count 14653,
:fail-count 0,
:info-count 119,
:by-f {:assign {:valid? true,
:count 1901,
:ok-count 1901,
:fail-count 0,
:info-count 0},
:crash {:valid? false,
:count 8,
:ok-count 0,
:fail-count 0,
:info-count 8},
:debug-topic-partitions {:valid? true,
:count 8,
:ok-count 8,
:fail-count 0,
:info-count 0},
:poll {:valid? true,
:count 6551,
:ok-count 6551,
:fail-count 0,
:info-count 0},
:send {:valid? true,
:count 6304,
:ok-count 6193,
:fail-count 0,
:info-count 111}}},
clock
and perf
are trivially true; they generate the clock, latency, and
rate plots as a side effect.
:clock {:valid? true},
:perf {:latency-graph {:valid? true},
:rate-graph {:valid? true},
:valid? true},
ex
tracks exceptions thrown by Jepsen Clients during the test. When
exceptions appear here, you may want to add them to the error handling in that
particular client. It is, in general, always safe to allow them to
throw---explicitly handling these errors improves test specificity and
performance.
assert
looks for assertion errors thrown by the Redpanda server, by parsing
logfiles.
workload
contains workload-specific results.
The queue workload includes two important keys: :error-types
, which shows all
"interesting" behaviors observed during the test, and :bad-error-types
, which
are those we think are specifically illegal given the test being run. For
example:
:error-types (:G0
:duplicate
:int-nonmonotonic-poll
:int-poll-skip
:poll-skip),
:bad-error-types (:duplicate
:int-nonmonotonic-poll
:int-poll-skip
:poll-skip)},
Here we detected a G0 anomaly, but because those happen normally in the Kafka transaction model, we didn't flag it as a "bad" error. The duplicate, internal nonmonotonic polls and poll skips, and external poll skips, caused this test to fail.
Each error type has a corresponding key in :errors
part of the workload
results. Examples of each type of error follow. For simplicity, these examples
are drawn from the test suite's internal tests; they use keys like :x
rather
than integers, but should otherwise be structurally alike to those reported by
the real test harness.
:inconsistent-offsets
signifies that a single offset in some key's log
contained multiple values. The offset-value mapping, called a version order
,
is derived from both send and poll operations. This example tells us that key
:x
, at offset 0, contained both values 1
and 2
. :index
is a dense
offset, without gaps.
{:inconsistent-offsets
({:key :x, :offset 0, :index 0, :values #{1 2}})}
:G1a
reports cases where an operation definitely failed, but one of its
writes appeared in a poll operation. For instance, this error shows that on key
:x
, value 2
was written by a send operation which failed, but that send was
later observed by :reader
.
{:G1a
({:key :x,
:value 2,
:writer
{:index 1,
:time 1,
:process 0,
:type :fail,
:f :send,
:value [[:send :x 2] [:send :y 3]]},
:reader
{:index 3,
:time 3,
:process 1,
:type :ok,
:f :poll,
:value [[:poll {:x [[0 2]]}]]}})}
:lost-write
finds cases where a known-successful send occurs at offset a, and
some offset b (such that a < b) is polled, and the message at offset a never
appears to any poll. This example shows that on key :x
, value :a
was lost.
It was written at index 0 in the log for :x
, and the highest index which was
observed in some poll was 2. The writer of value :a
is given, and so is the
reader which polled index 2. Because we expect pollers (across all consumers,
at least) observe values without gaps, this read of offset 2 implies we should
also have read offset 0--and yet no such read was found in this history.
{:lost-write
({:key :x,
:value :a,
:index 0,
:max-read-index 2,
:writer
{:index 1,
:time 1,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [0 :a]]]},
:max-read
{:index 7,
:time 7,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[2 :c]]}]]}}
Note that this analysis is sophisticated enough to reason about inconsistent offsets conservatively. Not all parts of the checker do this, but lost-writes is careful to keep track of multiple indexes for a given message value, and multiple values at a given index.
The lost-write checker also helps verify transactional atomicity: reading one part of a transaction lets the checker prove that all the other writes must have been written also---even if the writing transaction was itself indeterminate.
:poll-skip
finds places where a poller unexpectedly jumps over some offsets in
the log during two successive calls to poll
performed by different
operations on the same client, and there was no call to assign
or subscribe
between those polls which would have caused the consumer to forget the offset
it was tracking. This example shows that on key :x
, a single consumer jumped
2 indexes forward in the log, skipping over value :c
. The two operations are
shown: the first polled :a
and :b
at indexes 1 and 2, and the second polled
:d
at index 4. The checker here knew that there existed a value :c
at index
3 between these two polls, which went unobserved. :delta
is the number of
indexes between the two polls--if pollers read in perfect order, we'd expect
this to always be 1.
{:poll-skip
({:key :x,
:delta 2,
:skipped (:c),
:ops
[{:index 1,
:time 1,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[1 :a] [2 :b]]}]]}
{:index 7,
:time 7,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[4 :d]]}]]}]})}
:int-poll-skip
finds the same thing, but inside a single operation. This is
helpful for detecting anomalies that could occur inside a single transaction.
For instance, this error shows that on key x
, a single transaction polled values :a
and :d
in sequence, which skipped over :b
.
{:int-poll-skip
({:key :x,
:values [:a :d],
:delta 2,
:skipped (:b),
:op
{:index 3,
:time 3,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[1 :a] [4 :d]]}]]}})}
:nonmonotonic-poll
finds cases where a single consumer, without changing its
assign/subscribe mapping for some key, performed subsequent calls to poll
and
observed values in the second poll which started at or before the previous poll
finished. For instance, this error shows two successive operations on the same consumer which polled values :c
then :b
, jumping back one index.
{:nonmonotonic-poll
({:key :x,
:values [:c :b],
:delta -1,
:ops
[{:index 3,
:time 3,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[1 :a] [2 :b] [3 :c]]}]]}
{:index 7,
:time 7,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[2 :b] [3 :c] [4 :d]]}]]}]})}
:int-nonmonotonic-poll
does the same, but inside a single operation. For
instance, this error shows that on key :x
, a single process went from reading offset 3 (:c
) to offset 1 (:a
), jumping two indices backwards in the log.
{:int-nonmonotonic-poll
({:key :x,
:values [:c :a],
:delta -2,
:op
{:index 3,
:time 3,
:process 0,
:type :ok,
:f :poll,
:value [[:poll {:x [[3 :c] [1 :a]]}]]}})}
:int-send-skip
looks for a single transaction which performs two subsequent
sends to the same key, and some other offset lands between those writes. This
is another way to detect G0 cycles, where writes interleave with one another.
In this example, a single transaction wrote :a
and :c
to key :x
, skipping
over message :b
from another transaction. This shows a lack of write
isolation between Kafka/Redpanda transactions, and appears to be normal
behavior.
{:int-send-skip
({:key :x,
:values [:a :c],
:delta 2,
:skipped (:b),
:op
{:index 1,
:time 1,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [1 :a]] [:send :x :c]]}})},
:nonmonotonic-send
finds cases where two subsequent operations on the same
producer sent values to the same key, and the latter message wound up at an
offset prior to the former message. For instance, this case shows that on key
:x
, two calls to send
on the same producer, split across two different
operations, wrote offsets out of order. The second operation's first send of
:a
landed three indices before the first operation's final send of :d
.
{:nonmonotonic-send
({:key :x,
:values [:d :a],
:delta -3,
:ops
[{:index 1,
:time 1,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [3 :c]] [:send :x [4 :d]]]}
{:index 5,
:time 11,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [1 :a]] [:send :x [2 :b]]]}]})}
:int-nonmonotonic-send
is the same thing, but inside a single transaction.
Here, two calls to send within a single transaction received offsets out of
order on key :x
.
{:int-nonmonotonic-send
({:key :x,
:values [:c :a],
:delta -1,
:op
{:index 1,
:time 1,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [3 :c]] [:send :x [1 :a]]]}}
:duplicate
occurs when a single value appears at multiple offsets in some key's log. Since we only ever insert unique values, and do not (above the level of the Kafka producer's internal retries) ever retry, we expect that each value appear at most once per key. This example shows that on key :x
, value :a
appeared at two distinct offsets.
{:duplicate ({:key :x, :value :a, :count 2})}
:unseen
reports the number of messages which were successfully committed but
have not appeared in any poll, as of the last poll in the history. This checker
cannot distinguish between a lost write vs one which is simply very delayed. To
mitigate this weakness, we try really hard to read everything at the end of a
test---but that's still not a guarantee that :unseen
errors are truly lost.
Sure, they texted "omw" three hours ago and still haven't shown up to Show
Tunes at Sidetrack, but they might not be dead. Maybe they're watching a sixth
episode of Golden Girls and trying to figure out which high tops to wear. We
don't judge.
This example includes the time (in nanoseconds since the start of the test) of
the final unseen inference, a map :unseen
of keys to the number of unseen
messages on each key, and a map :messages
of keys to the specific messages
unseen. This test failed to observe one committed message on key 6, and 5 on
key 23.
{:unseen {:time 1256465220303,
:unseen {6 1, 23 5},
:messages {6 (311),
23 (360 361 362 363 365)}}}
:G0
finds write cycles: a cluster of transactions such that each wrote some
message both before and after every other transaction in the cluster. A data
structure representation is included in the workload. Here, for instance, a
pair of transactions had a cycle where T1 wrote before T2 on key :x
, and
vice-versa on key :y
. The :cycle
key shows the operations involved: [T1,
T2, T1]. The :steps
field explains the relationships between successive pairs
in that cycle. The first relationship was a write-write (:ww
) dependency on key :x
, where the first transaction wrote message :a
and the second wrote message :b
.
{:G0
[{:cycle
[{:index 2,
:time 2,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [0 :a]] [:send :y [1 :a]]]}
{:index 3,
:time 3,
:process 1,
:type :ok,
:f :send,
:value [[:send :x [1 :b]] [:send :y [0 :b]]]}
{:index 2,
:time 2,
:process 0,
:type :ok,
:f :send,
:value [[:send :x [0 :a]] [:send :y [1 :a]]]}],
:steps
({:type :ww,
:key :x,
:value :a,
:value' :b,
:a-mop-index 0,
:b-mop-index 0}
{:type :ww,
:key :y,
:value :b,
:value' :a,
:a-mop-index 1,
:b-mop-index 1}),
:type :G0}]}
This can be a bit hard to understand from the data structure representation,
but you'll find corresponding plain-English and visual diagrams explaining this
cycle in elle/G0.txt
and elle/g0/
.
:G1c
finds circular information flow: clusters where a cycle exists composed
of write-read and write-write dependencies. As in G0, write-write dependencies
are inferred from offsets written. Write-read dependencies are inferred
whenever one transaction polls another's sent messages. This G1c is comprised
entirely of write-read dependencies---note the :wr
types in each step. The
first transaction sent :a
to key :x
, which was read by the second
transaction. The second transaction sent :b
to key :y
, which was read by
the first. Like G0, textual and visual explanations of this anomaly are
available in the elle/
directory.
{:G1c
[{:cycle
[{:index 2,
:time 2,
:process 0,
:type :ok,
:f :txn,
:value [[:send :x [0 :a]] [:poll {:y [[0 :b]]}]]}
{:index 3,
:time 3,
:process 1,
:type :ok,
:f :txn,
:value [[:send :y [0 :b]] [:poll {:x [[0 :a]]}]]}
{:index 2,
:time 2,
:process 0,
:type :ok,
:f :txn,
:value [[:send :x [0 :a]] [:poll {:y [[0 :b]]}]]}],
:steps
({:type :wr, :key :x, :value :a, :a-mop-index 0, :b-mop-index 1}
{:type :wr, :key :y, :value :b, :a-mop-index 0, :b-mop-index 1}),
:type :G1c}]}
Note that both G0 and G1c cycles involving write-write edges may or may not be
"real" depending on how you interpret Kafka's producer.send
semantics. If you
prefer to ignore these write dependencies, pass --no-ww-deps
to the test, and
it'll only infer write-read edges.
Queue tests also generate an additional file, consume-counts.edn
. This file
attempts---perhaps poorly---to tell whether a history offered "exactly once
semantics". It looks at all successful operations which performed a poll
operation while using subscribe
(not assign
, which we expect leads to
duplicate polls!), and counts the number of times each value was polled. Under
exactly-once semantics, I think this number should always be 1, but we were
never able to get this to work.
This file has two keys. :distribution
is a map of counts (i.e. how many times
a record was polled) to the number of times that count occurred. :dup-counts
is a map of keys (topic-partitions) to values to the number of times that value
was polled, for any values which were polled multiple times. For examples, this test had 17133 messages which were polled once, and 174 which were polled twice. Key 2 had a single duplicate, message 79, which was saw twice.
{:distribution {1 17133, 2 174},
:dup-counts
{2 {79 2},
3
{111 2,
112 2,
113 2,
114 2,
...}
...}}
project.clj
defines this test's version, dependencies, JVM options, and entry
point; it's read by Leiningen. Source code for the test suite lives in src/
.
Tests for that testing code live in test/
. In both of these directories,
folder structure maps to namespaces: the file src/jepsen/redpanda/core.clj
defines the jepsen.redpanda.core
namespace. The store/
directory stores the
results of any tests you might run.
The top-level namespace for this test is jepsen.redpanda.core
, which defines
CLI options and constructs tests to run, then passes them to Jepsen for
execution. jepsen.redpanda.db.redpanda
and jepsen.redpanda.db.kafka
defines
database setup and teardown for Redpanda and Kafka, respectively.
jepsen.redpanda.client
is for working with Kafka clients.
jepsen.redpanda.nemesis
handles fault injection: most notably, the cluster
membership state machine. Workloads live in jepsen.redpanda.workload.queue
and jepsen.redpanda.workload.list-append`.
This test comprises two workloads.
The main workload, queue
, performs both transactional and non-transactional
sends and polls, mixed with calls to assign
or subscribe
. At the end of the
test, it tries to read everything that was written via a series of final polls.
It has a sophisticated family of analyzers which look for duplicate writes,
inconsistent offsets, places where consumers or producers jump forward or
backwards in offsets, aborted reads, and some basic cycles like G0 and G1c.
The second workload, list-append
, is much simpler: it performs sends much
like queue
, but for any read, attempts to read the entire topic-partition
based on the most recent offset.
Both workloads spread their operations across a rotating pool of
topic-partitions, creating new topics once existing topics reach a certain
threshold of writes. You can choose which workload is run via -w queue
or -w list-append
.
--nemesis pause,kill
performs randomized process pauses and kill -9. The
other fault types are clock
, which jitters clocks around, partition
, which
partitions the network between DB nodes (but not between clients and servers!),
and membership
, which adds and politely removes nodes from the cluster.
Note that clock skew tests will only work on nodes which have real clocks---Docker and LXC can't change the system clock.
Note also that the membership nemesis in the main branch only works with new
APIs introduced after 21.11.2; tests using --nemesis membership
will still
run with versions 21.11.2, but won't actually remove nodes, and will complain a
lot. Use git checkout compat-21.11.2
for the last version of the test suite
which ran with 21.11.2's membership APIs. Note that this membership nemesis may
drive the cluster into unsafe regimes, so watch its activity in jepsen.log
carefully when verifying a failure.
--nemesis-interval 5
sets the mean interval between nemesis operations to 5
seconds; see latency.png
to get a sense of how this affects availability.
``-db-targetscontrols how DB-node related faults choose their targets;
--db-targets all`, for instance, kills and pauses every node at once,
whereas `--db-targets one` only kills or pauses a single node at a time.
`--partition-targets` controls how Jepsen chooses the topology of network
partitions.
This test harness also comes with its own tests--mainly for the queue
workload's various analyzers. These tests live in test/
, and can be run via
lein test
---not to be confused with lein run test
, which runs the Jepsen
test itself. Use these if you wind up changing the checkers somehow, to make
sure they still detect the anomalies they ought to. These tests can also be
helpful in understanding why the various analyzer functions work the way they
do, and what they consider an anomaly.
Sometimes you need to explore a test's history in more detail. lein repl
will
spawn a Clojure repl with all the test suite's code available. To start with,
you might want a few namespaces and functions available:
jepsen.redpanda.core=> (require '[jepsen.store :as s] '[jepsen.checker :refer [check]] '[jepsen.redpanda.workload.queue :as q] '[jepsen.redpanda.client :as c])
We can load a test from disk using jepsen.store/test
. It can take a path to a
particular test's directory in store/
. As a shortcut, clicking the title of a
test directory in the web interface will copy this path as a string, so you can
paste it right into the REPL.
jepsen.redpanda.core=> (def t (s/test "/home/aphyr/redpanda/store/2022-01-19.deb queue subscribe acks=all retries=1000 aor=earliest default-r=3 auto-topics=false idem=true pause/20220120T220302.000-0500"))
Was this test valid?
jepsen.redpanda.core=> (:valid? (:results t))
false
No! Why not?
jepsen.redpanda.core=> (->> t :results :workload pprint)
... eight billion lines ...
:time 63860531157,
:process 131,
:index 7195}}]}}
Well, that's a lot to read. What keys are in this map?
jepsen.redpanda.core=> (->> t :results :workload keys)
(:valid? :unseen :poll-skip :info-txn-causes :worst-realtime-lag :int-send-skip :lost-write :nonmonotonic-poll :error-types :int-poll-skip :int-nonmonotonic-poll)
Right, let's look at the error types:
jepsen.redpanda.core=> (->> t :results :workload :error-types)
[:int-nonmonotonic-poll :int-poll-skip :int-send-skip :lost-write :nonmonotonic-poll :poll-skip :unseen]
All kinds of cool stuff here. How many lost-write errors?
jepsen.redpanda.core=> (->> t :results :workload :lost-write :count)
4
What was the first?
jepsen.redpanda.core=> (->> t :results :workload :lost-write :errs first pprint)
{:key 54,
:value 436,
:index 401,
:max-read-index 977,
:writer
{:type :ok,
:f :txn,
:value [[:send 54 [1114 436]] [:poll {}] [:poll {}]],
:time 800986095314,
:process 648,
:rebalance-log
[{:type :revoked, :keys [55 4]} {:type :revoked, :keys [53]}],
:index 91528},
:max-read
{:type :ok,
:f :txn,
:value
[[:poll ...]
[:poll ...]
[:send 56 [1725 661]]
[:send 55 [2245 800]]],
:time 828187300771,
:process 590,
:index 95996}}
OK, so something went wrong on key 54, around value 436. What operation wrote that? The result claims it has the writer here, but just to double-check, let's look at the history ourselves.
jepsen.redpanda.core=> (->> t :history (q/writes-of-key-value 54 436) pprint)
({:type :invoke,
:f :txn,
:value [[:send 54 436] [:poll] [:poll]],
:time 791552742537,
:process 648}
{:type :ok,
:f :txn,
:value [[:send 54 [1114 436]] [:poll {}] [:poll {}]],
:time 800986095314,
:process 648,
:rebalance-log
[{:type :revoked, :keys [55 4]} {:type :revoked, :keys [53]}]})
Here's the invocation and the completion of the operation which wrote key 54. Sure enough, it appears to have suceeded! Was it ever read?
jepsen.redpanda.core=> (->> t :history (q/reads-of-key-value 54 436) pprint)
()
Nothing ever read it. What about nearby messages? We know this write ostensibly went to offset 1114--let's look at everything that interacted with the local neighborhood of, say, five offsets before and after 1114.
jepsen.redpanda.core=> (->> t :history (q/around-key-offset 54 1114 5) pprint)
({:type :ok,
:f :txn,
:value ([:send 54 [1117 427]]),
:time 791596972858,
:process 778}
{:type :ok,
:f :txn,
:value ([:send 54 [1110 434]]),
:time 791613396199,
:process 584}
{:type :ok,
:f :txn,
:value ([:send 54 [1114 436]]),
:time 800986095314,
:process 648,
:rebalance-log
[{:type :revoked, :keys [55 4]} {:type :revoked, :keys [53]}]}
{:type :ok,
:f :txn,
:value ([:poll {54 ([1110 434] [1117 427])}]),
:time 828187300771,
:process 590}
... lots more polls ...
{:f :poll,
:value ([:poll {54 ([1110 434] [1117 427])}]),
:poll-ms 1000,
:time 1039009647614,
:process 868,
:type :ok})
So we successfully wrote 427 to offset 1117, 434 to offset 1110, and 436 to offset 1114. Yet somehow when we went to read anything in this neighborhood, we only observed values 434 and 427---no 436! This very much looks to be a lost write! Note that around-key-offset has trimmed the polls and sends inside each operation in order to show us only those parts relevant to this particular key and region of offsets. The real poll operations here have hundreds of messages each, so this is much easier to read.
You'll find several more functions for slicing and dicing history operations in
jepsen.redpanda.workload.queue
.
If you're running in containers Redpanda may fail to start, citing a need to
increase /proc/sys/fs/aio-max-nr
on the host OS--individual containers can't
alter it themselves. Try
sudo sh -c 'echo 10000000 > /proc/sys/fs/aio-max-nr'
Copyright © 2021, 2022 Jepsen. LLC
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.