You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy
All commits from #32165 except the last one.
This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.
One of the main ideas in #31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
it in-place.
2. trees without exclusive ownership over a node must clone the node
in order to modify it. Once cloned, the tree will now have exclusive
ownership over that node. When cloning the node, the reference count
of all of the node's children must be incremented.
In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.
Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.
The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in #32165.
```
name old time/op new time/op delta
BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5)
BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5)
BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5)
BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5)
BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5)
BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5)
BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5)
BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5)
BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5)
BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5)
```
We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name time/op
BTreeDeleteInsert/count=16-4 198ns ±28%
BTreeDeleteInsert/count=128-4 375ns ± 3%
BTreeDeleteInsert/count=1024-4 577ns ± 2%
BTreeDeleteInsert/count=8192-4 798ns ± 1%
BTreeDeleteInsert/count=65536-4 2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18%
name alloc/op
BTreeDeleteInsert/count=16-4 0.00B
BTreeDeleteInsert/count=128-4 0.00B
BTreeDeleteInsert/count=1024-4 0.00B
BTreeDeleteInsert/count=8192-4 0.00B
BTreeDeleteInsert/count=65536-4 0.00B
BTreeDeleteInsertCloneOnce/count=16-4 0.00B
BTreeDeleteInsertCloneOnce/count=128-4 0.00B
BTreeDeleteInsertCloneOnce/count=1024-4 0.00B
BTreeDeleteInsertCloneOnce/count=8192-4 0.00B
BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B
name allocs/op
BTreeDeleteInsert/count=16-4 0.00
BTreeDeleteInsert/count=128-4 0.00
BTreeDeleteInsert/count=1024-4 0.00
BTreeDeleteInsert/count=8192-4 0.00
BTreeDeleteInsert/count=65536-4 0.00
BTreeDeleteInsertCloneOnce/count=16-4 0.00
BTreeDeleteInsertCloneOnce/count=128-4 0.00
BTreeDeleteInsertCloneOnce/count=1024-4 0.00
BTreeDeleteInsertCloneOnce/count=8192-4 0.00
BTreeDeleteInsertCloneOnce/count=65536-4 0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00
```
Release note: None
0 commit comments