Skip to content

Commit 8edd89d

Browse files
storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy
All commits from #32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
1 parent 5505d43 commit 8edd89d

File tree

2 files changed

+267
-24
lines changed

2 files changed

+267
-24
lines changed

pkg/storage/cmdq/interval_btree.go

+143-24
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ import (
1818
"bytes"
1919
"sort"
2020
"strings"
21+
"sync"
22+
"sync/atomic"
2123
"unsafe"
2224

2325
"github.com/cockroachdb/cockroach/pkg/roachpb"
@@ -106,21 +108,121 @@ func upperBound(c *cmd) keyBound {
106108
}
107109

108110
type leafNode struct {
109-
max keyBound
111+
ref int32
110112
count int16
111113
leaf bool
114+
max keyBound
112115
cmds [maxCmds]*cmd
113116
}
114117

115-
func newLeafNode() *node {
116-
return (*node)(unsafe.Pointer(&leafNode{leaf: true}))
117-
}
118-
119118
type node struct {
120119
leafNode
121120
children [maxCmds + 1]*node
122121
}
123122

123+
func leafToNode(ln *leafNode) *node {
124+
return (*node)(unsafe.Pointer(ln))
125+
}
126+
127+
func nodeToLeaf(n *node) *leafNode {
128+
return (*leafNode)(unsafe.Pointer(n))
129+
}
130+
131+
var leafPool = sync.Pool{
132+
New: func() interface{} {
133+
return new(leafNode)
134+
},
135+
}
136+
137+
var nodePool = sync.Pool{
138+
New: func() interface{} {
139+
return new(node)
140+
},
141+
}
142+
143+
func newLeafNode() *node {
144+
n := leafToNode(leafPool.Get().(*leafNode))
145+
n.leaf = true
146+
n.ref = 1
147+
return n
148+
}
149+
150+
func newNode() *node {
151+
n := nodePool.Get().(*node)
152+
n.ref = 1
153+
return n
154+
}
155+
156+
// mut creates and returns a mutable node reference. If the node is not shared
157+
// with any other trees then it can be modified in place. Otherwise, it must be
158+
// cloned to ensure unique ownership. In this way, we enforce a copy-on-write
159+
// policy which transparently incorporates the idea of local mutations, like
160+
// Clojure's transients or Haskell's ST monad, where nodes are only copied
161+
// during the first time that they are modified between Clone operations.
162+
//
163+
// When a node is cloned, the provided pointer will be redirected to the new
164+
// mutable node.
165+
func mut(n **node) *node {
166+
if atomic.LoadInt32(&(*n).ref) == 1 {
167+
// Exclusive ownership. Can mutate in place.
168+
return *n
169+
}
170+
// If we do not have unique ownership over the node then we
171+
// clone it to gain unique ownership. After doing so, we can
172+
// release our reference to the old node.
173+
c := (*n).clone()
174+
(*n).decRef(true /* recursive */)
175+
*n = c
176+
return *n
177+
}
178+
179+
// incRef acquires a reference to the node.
180+
func (n *node) incRef() {
181+
atomic.AddInt32(&n.ref, 1)
182+
}
183+
184+
// decRef releases a reference to the node. If requested, the method
185+
// will recurse into child nodes and decrease their refcounts as well.
186+
func (n *node) decRef(recursive bool) {
187+
if atomic.AddInt32(&n.ref, -1) > 0 {
188+
// Other references remain. Can't free.
189+
return
190+
}
191+
// Clear and release node into memory pool.
192+
if n.leaf {
193+
ln := nodeToLeaf(n)
194+
*ln = leafNode{}
195+
leafPool.Put(ln)
196+
} else {
197+
// Release child references first, if requested.
198+
if recursive {
199+
for i := int16(0); i <= n.count; i++ {
200+
n.children[i].decRef(true /* recursive */)
201+
}
202+
}
203+
*n = node{}
204+
nodePool.Put(n)
205+
}
206+
}
207+
208+
// clone creates a clone of the receiver with a single reference count.
209+
func (n *node) clone() *node {
210+
var c *node
211+
if n.leaf {
212+
c = newLeafNode()
213+
*nodeToLeaf(c) = *nodeToLeaf(n)
214+
} else {
215+
c = newNode()
216+
*c = *n
217+
// Increase refcount of each child.
218+
for i := int16(0); i <= c.count; i++ {
219+
c.children[i].incRef()
220+
}
221+
}
222+
c.ref = 1
223+
return c
224+
}
225+
124226
func (n *node) insertAt(index int, c *cmd, nd *node) {
125227
if index < int(n.count) {
126228
copy(n.cmds[index+1:n.count+1], n.cmds[index:n.count])
@@ -246,7 +348,7 @@ func (n *node) split(i int) (*cmd, *node) {
246348
if n.leaf {
247349
next = newLeafNode()
248350
} else {
249-
next = &node{}
351+
next = newNode()
250352
}
251353
next.count = n.count - int16(i+1)
252354
copy(next.cmds[:], n.cmds[i+1:n.count])
@@ -286,7 +388,7 @@ func (n *node) insert(c *cmd) (replaced, newBound bool) {
286388
return false, n.adjustUpperBoundOnInsertion(c, nil)
287389
}
288390
if n.children[i].count >= maxCmds {
289-
splitcmd, splitNode := n.children[i].split(maxCmds / 2)
391+
splitcmd, splitNode := mut(&n.children[i]).split(maxCmds / 2)
290392
n.insertAt(i, splitcmd, splitNode)
291393

292394
switch cmp := cmp(c, n.cmds[i]); {
@@ -299,7 +401,7 @@ func (n *node) insert(c *cmd) (replaced, newBound bool) {
299401
return true, false
300402
}
301403
}
302-
replaced, newBound = n.children[i].insert(c)
404+
replaced, newBound = mut(&n.children[i]).insert(c)
303405
if newBound {
304406
newBound = n.adjustUpperBoundOnInsertion(c, nil)
305407
}
@@ -316,7 +418,7 @@ func (n *node) removeMax() *cmd {
316418
n.adjustUpperBoundOnRemoval(out, nil)
317419
return out
318420
}
319-
child := n.children[n.count]
421+
child := mut(&n.children[n.count])
320422
if child.count <= minCmds {
321423
n.rebalanceOrMerge(int(n.count))
322424
return n.removeMax()
@@ -336,12 +438,12 @@ func (n *node) remove(c *cmd) (out *cmd, newBound bool) {
336438
}
337439
return nil, false
338440
}
339-
child := n.children[i]
340-
if child.count <= minCmds {
441+
if n.children[i].count <= minCmds {
341442
// Child not large enough to remove from.
342443
n.rebalanceOrMerge(i)
343444
return n.remove(c)
344445
}
446+
child := mut(&n.children[i])
345447
if found {
346448
// Replace the cmd being removed with the max cmd in our left child.
347449
out = n.cmds[i]
@@ -389,8 +491,8 @@ func (n *node) rebalanceOrMerge(i int) {
389491
// v
390492
// a
391493
//
392-
left := n.children[i-1]
393-
child := n.children[i]
494+
left := mut(&n.children[i-1])
495+
child := mut(&n.children[i])
394496
xCmd, grandChild := left.popBack()
395497
yCmd := n.cmds[i-1]
396498
child.pushFront(yCmd, grandChild)
@@ -428,8 +530,8 @@ func (n *node) rebalanceOrMerge(i int) {
428530
// v
429531
// a
430532
//
431-
right := n.children[i+1]
432-
child := n.children[i]
533+
right := mut(&n.children[i+1])
534+
child := mut(&n.children[i])
433535
xCmd, grandChild := right.popFront()
434536
yCmd := n.cmds[i]
435537
child.pushBack(yCmd, grandChild)
@@ -464,7 +566,9 @@ func (n *node) rebalanceOrMerge(i int) {
464566
if i >= int(n.count) {
465567
i = int(n.count - 1)
466568
}
467-
child := n.children[i]
569+
child := mut(&n.children[i])
570+
// Make mergeChild mutable, bumping the refcounts on its children if necessary.
571+
_ = mut(&n.children[i+1])
468572
mergeCmd, mergeChild := n.removeAt(i)
469573
child.cmds[child.count] = mergeCmd
470574
copy(child.cmds[child.count+1:], mergeChild.cmds[:mergeChild.count])
@@ -474,6 +578,7 @@ func (n *node) rebalanceOrMerge(i int) {
474578
child.count += mergeChild.count + 1
475579

476580
child.adjustUpperBoundOnInsertion(mergeCmd, mergeChild)
581+
mergeChild.decRef(false /* recursive */)
477582
}
478583
}
479584

@@ -547,25 +652,39 @@ type btree struct {
547652
length int
548653
}
549654

550-
// Reset removes all cmds from the btree.
655+
// Reset removes all cmds from the btree. In doing so, it allows memory
656+
// held by the btree to be recycled. Failure to call this method before
657+
// letting a btree be GCed is safe in that it won't cause a memory leak,
658+
// but it will prevent btree nodes from being efficiently re-used.
551659
func (t *btree) Reset() {
552-
t.root = nil
660+
if t.root != nil {
661+
t.root.decRef(true /* recursive */)
662+
t.root = nil
663+
}
553664
t.length = 0
554665
}
555666

556-
// Silent unused warning.
557-
var _ = (*btree).Reset
667+
// Clone clones the btree, lazily.
668+
func (t *btree) Clone() btree {
669+
c := *t
670+
if c.root != nil {
671+
c.root.incRef()
672+
}
673+
return c
674+
}
558675

559676
// Delete removes a cmd equal to the passed in cmd from the tree.
560677
func (t *btree) Delete(c *cmd) {
561678
if t.root == nil || t.root.count == 0 {
562679
return
563680
}
564-
if out, _ := t.root.remove(c); out != nil {
681+
if out, _ := mut(&t.root).remove(c); out != nil {
565682
t.length--
566683
}
567684
if t.root.count == 0 && !t.root.leaf {
685+
old := t.root
568686
t.root = t.root.children[0]
687+
old.decRef(false /* recursive */)
569688
}
570689
}
571690

@@ -575,16 +694,16 @@ func (t *btree) Set(c *cmd) {
575694
if t.root == nil {
576695
t.root = newLeafNode()
577696
} else if t.root.count >= maxCmds {
578-
splitcmd, splitNode := t.root.split(maxCmds / 2)
579-
newRoot := &node{}
697+
splitcmd, splitNode := mut(&t.root).split(maxCmds / 2)
698+
newRoot := newNode()
580699
newRoot.count = 1
581700
newRoot.cmds[0] = splitcmd
582701
newRoot.children[0] = t.root
583702
newRoot.children[1] = splitNode
584703
newRoot.max = newRoot.findUpperBound()
585704
t.root = newRoot
586705
}
587-
if replaced, _ := t.root.insert(c); !replaced {
706+
if replaced, _ := mut(&t.root).insert(c); !replaced {
588707
t.length++
589708
}
590709
}

0 commit comments

Comments
 (0)