-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathindex.template.html
1996 lines (1790 loc) · 148 KB
/
index.template.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/html">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Exploring the design space of binary search trees</title>
<meta name="description" content="An open-source research project that explores the design space of binary search trees to implement persistent and concurrent linear list data structures. This work includes multiple original contributions to data structures and algorithms, illustrations, console animations, and benchmarks." />
<meta name="keywords" content="binary search trees, avl, red-black, wavl, ravl, lbst, wbst, scapegoat tree" />
<meta property="og:title" content="Exploring the design space of binary search trees" />
<meta property="og:description" content="Open-source research project that explores the design space of binary search trees to implement persistent and concurrent linear list data structures." />
<meta property="og:image" content="https://raw.githubusercontent.com/rtheunissen/bst/main/docs/preview.png" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@rtheunissen_" />
<meta name="twitter:title" content="Exploring the design space of binary search trees" />
<meta name="twitter:description" content="Open-source research project that explores the design space of binary search trees to implement persistent and concurrent linear list data structures." />
<meta name="twitter:image" content="https://raw.githubusercontent.com/rtheunissen/bst/main/docs/preview.png" />
<style>
{{ inline "docs/katex/katex.min.css" }}
{{ inline "docs/index.css" }}
</style>
</head>
<body>
<!-- This is the template for inline citations. -->
{{ define "reference" }}{{ $reference := reference . }}<a class="citation" href="#reference-{{ $reference.Index }}">[{{ $reference.Index }}]</a>{{ end }}
<article>
<h1 id="title">Exploring the design space of binary search trees</h1>
<small>
Rudi Theunissen <br> Last updated: {{ date }}
</small>
<div class="noprint">
<a href="https://github.com/rtheunissen/bst"><small>Repository</small></a> /
<a href="/bst/docs/benchmarks/svg/operations/AVLRedBlack/Insert/"><small>Benchmarks</small></a> /
<a href="/bst/docs/animations/"><small>Animations</small></a> /
<a href="https://news.ycombinator.com/item?id=37130200"><small>Discussion</small></a>
</div>
<section id="abstract">
<p>
The study of <em>binary search trees</em> has been a fundamental topic in computer science going back to the 1950s.
Anyone who has studied computer science or prepared for a technical programming interview is familiar with them, and many leading courses on algorithms and data structures teach them, including Princeton,{{ template "reference" "2023_princeton" }} Stanford,{{ template "reference" "2023_stanford" }} and UC Berkeley.{{ template "reference" "2023_uc_berkeley" }}
These courses usually teach the same selection of algorithms in a similar way, though never exactly the same.
The motivation for this project, in part, was to innovate the way we teach fundamental algorithms and data structures, specifically binary search trees.
Speaking from experience, some students might not realize how interesting this space can be because they are too distracted by seemingly arbitrary rules and acronyms.
</p>
<p>
Over the years, some binary search tree algorithms have come along that are both more intuitive and more efficient than the ones we usually learn about, but good implementations are often difficult to find and therefore difficult to teach.
This project attempts to explore as many of these algorithms as possible to produce a reference that is easy to teach and learn from.
Being mostly light on theory, we focus instead on experimentation and measurement.
A few of the algorithms appear to be implemented for the first time, and some original ideas are also considered.
</p>
<p>
<a href="#introduction">Part 1</a> briefly covers foundational ideas like memory allocation, linear lists, position, recursion, traversal, balance, persistence, and concurrency.
This introduction explores a pedagogical approach to binary search trees as a natural extension of linked lists.
Since linked lists are usually covered before binary search trees, the idea to implement the same interface as before but now with a different structure is a nice segue, cohesive and consistent.
</p>
<p>
<a href="#restoring-balance">Part 2</a> covers definitions of balance and explores various algorithms to globally balance a binary search tree.
Topics to cover include tree rotations, partitioning, complexity, and the binary logarithm.
This part is an opportunity to become more familiar with the binary tree structure and its measurements.
</p>
<p>
<a href="#self-balancing-trees">Part 3</a> explores self-balancing trees, which is where most of the research and discussion usually happens.
There are many strategies to consider, including rank-balanced trees, weight-balanced trees, randomized trees, finger trees, and self-adjusting trees.
Some <em>weak</em> and <em>relaxed</em> variants are also implemented and evaluated.
There are multiple benchmarks and measurements to determine which strategies are better than others in different situations, and whether there might be a case to prefer other algorithms to the ones typically found in textbooks and lectures.
</p>
</section>
<div class="page-break"></div>
<section id="introduction">
<h2><span>Part 1</span>Introduction</h2>
<section id="memory">
<h3>Abstract memory</h3>
<p>
A computer program uses memory to keep information in mind while working on a problem, similar to the human concept of working memory.
When more memory is needed, a program can ask an <em>allocator</em> to reserve some, and in return receive a unique address to access that memory.
All the memory that was allocated by a program is usually freed by the time the program exits, but during its lifetime there can be many requests to allocate, read, write, and free memory.
</p>
<p>
These memory interactions all have an energy cost because work must be done to track allocated memory, resolve addresses, and write to memory.
A program is considered efficient when it meets its functional requirements using only a small amount of energy relative to the complexity of the problem it solves.
Some problems can not avoid frequent interaction with memory, but an inefficient algorithm wastes energy by interacting with memory more often than it needs to.
</p>
<p>
This field of research relates to the organization of information in memory, focusing on the <em>organization</em> of the information rather than the information itself.
The goal is to design structures in memory to organize information in useful ways, such as to be ordered, searchable, or countable.
The information and the structures that organize it usually exist together in memory, and programs can use these structures to organize information as needed, rather than interacting with memory directly.
</p>
<p>
There are structures that organize information all around us, like filing cabinets and books and vinyl records.
Consider that an empty notebook contains no information, but the structure is there for information to exist.
A step-by-step cooking recipe might be written as a numbered list to suggest to the reader to follow the steps in order, one at a time.
The paper on which the recipe is written is the memory, the recipe is the information, the numbered list is the structure.
</p>
<p>
Using computer memory as a design material might seem intangible, but every bit of information exists physically on hardware no differently.
The computer, however, does not know anything about semantics and relationships — there is only one large pool of memory to interact with, and our task is to provide structures as a layer of abstraction between the program and the memory.
</p>
</section>
<div class="page-break"></div>
<section id="linear-lists">
<h3>Linear lists</h3>
<p>
This project explores specifically the implementation of a <em>linear list</em>, which is a one-dimensional ordered sequence of a known length, indexed by position.
Lists are fundamental and are often introduced as the first abstract data type when learning about data structures.
So many things are lists... search results, queues, events or transactions occurring over time, even the text of this article is an ordered sequence of characters arranged in a specific order — the same characters in different positions would be gibberish, so the position of each character in the sequence is important.
</p>
<p>
For a data structure to qualify as a list, it must support the operations that describe the list data type.
A program using a given list implementation might not know exactly how the structure works internally, only that it behaves like a list.
</p>
<p>
The following operations describe a reasonable list data type:
</p>
<ul>
<li><em>Select</em> produces the value at a given position.</li>
<li><em>Update</em> replaces the value at a given position.</li>
<li><em>Insert</em> introduces a new value at a given position.</li>
<li><em>Delete</em> removes the value at a given position.</li>
<li><em>Join</em> combines two lists into a new list.</li>
<li><em>Split</em> separates a list at a given position.</li>
</ul>
<section id="dynamic-arrays">
<h4>Arrays</h4>
<p>
Perhaps the most common implementation of a list in practice is the <em>array</em>, which is a block of adjacent fixed-size memory cells, allocated all at once, and indexed directly as an offset from the front of the block.
The memory allocated by an array is the number of cells multiplied by the information capacity of each cell.
The array is the structure and the values stored in each cell are the information.
</p>
<figure>
{{ $figure := figure "array" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: An array with a size of 7 and a capacity of 8.
</p>
</figure>
</figure>
<p>
To illustrate this, we can use a sheet of grid paper to represent an allocated array where every square is a cell in which a single character could be stored.
The sheet of grid paper models the array and the grid squares that have characters written in them form the information as a sequence.
<!-- Keep in mind however, there can be no spaces between the characters of the sequence within the array because the memory address of a cell is determined by its offset from the start of the array.-->
<!-- Starting with a blank sheet as an empty list, we might insert one character at a time from the top left corner, across the page by column and down the page by row, as we start filling up the memory of the array.-->
<!-- However, we could also start in the bottom right and spiral inward towards the center.-->
<!-- We are free to organize the memory however we like as long as we continue to maintain the behavior of a list.-->
The sequence of characters recorded on the sheet of grid paper has semantic meaning to the person or program asking us to record them, but the structure itself is not aware of that meaning.
</p>
<p>
Data structures usually offer theoretical bounds for every operation to indicate how long an operation might take or how much memory it might allocate, and is usually proportional to some measurable property of the structure, such as size.
Some data structures are particularly efficient for certain operations, but less so for others.
</p>
<p>
Arrays are great for accessing a value by position because there are no empty spaces between values in the array, which allows the memory address to be calculated as an offset from the address of the array itself.
This is called <em>constant time</em>, because the cost of the calculation is the same regardless of the size of the array.
Inserting a value at the end of the array is very efficient also, simply write into the next empty cell.
</p>
<p>
Consider however the case where a value is inserted somewhere within the array, or in the worst-case at the beginning of the array.
Since there can be no empty cells between values, space must be created by moving every value from that position one step towards the back of the array.
The cost of this movement of memory is <em>linear</em> in the size of the array, because the upper-bound on the amount of memory to move is exactly the size of the array.
Deletion is similarly costly because it leaves an empty cell, which requires space to be compacted in exactly the opposite way.
</p>
<p>
What makes an array <em>dynamic</em> is the ability to increase its capacity by reallocating the entire sequence into a larger block of memory, thereby copying every value from the original array to the new block.
Doing so is linear in the size of the array but occurs infrequently enough to amortize the cost, allowing many low-cost operations to make up for infrequent expensive ones.
Common ways to avoid this cost are to predict a good upper-bound for the initial capacity to avoid reallocation entirely, or to always double the previous capacity to effectively amortize the cost of reallocation.
</p>
</section>
<section id="linked-lists">
<h4>Linked lists</h4>
<p>
Found in every textbook on algorithms and data structures is the <em>linked list</em>, which consists of linked <em>nodes</em> that each contain one value and a pointer to the next node in the sequence.
The structure provides a pointer to the first node in the sequence, thereby allowing sequential access by following one link at a time from there.
The last node of the list points to an empty node, illustrated as <i>{{ inline "docs/plots/figures/leaf.svg" }}</i>.
</p>
<p>
To insert a new value at a given position within the list, start at the first node and follow links one node at a time until the number of links followed equals the search position.
At this point we can allocate a node for the new value, point the previous node to the new node, and point the new node to the current node.
</p>
<figure>
{{ $figure := figure "linked_list" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: A linked list of size 7, illustrating the empty node at the end as <i>{{ inline "docs/plots/figures/leaf.svg" }}</i>.
</p>
</figure>
<!-- <p>-->
<!-- Imagine a very long line of people in a queue, and everyone is asked to point to the next person in line after them to maintain order.-->
<!-- As people join the queue they are given a random position between zero and the number of people currently in line, which indicates the number of people they may cut in front of.-->
<!-- This gives everyone some chance of ending up towards the front even when they arrive late.-->
<!-- No one in line knows their actual position, since others may have joined the line behind them or ahead of them since they arrived.-->
<!-- The only way to operate this structure as a list is to start at the front of the line and count people one by one.-->
<!-- </p>-->
<!-- <p>-->
<!-- All the operations of a linked list require that we follow a number of links equal to the given search position.-->
<!-- The cost is therefore always linear in the size of the list, and there is no way to avoid that because the only way to reach a specific node is to follow the path that leads to it from the front.-->
<!-- One benefit perhaps is that there is no need to allocate a large amount of memory all at once or to reallocate the structure because memory is allocated individually for each value as needed.-->
<!-- </p>-->
</section>
</section>
<div class="page-break"></div>
<section id="binary-search-trees">
<h3>Binary search trees</h3>
<p>
Instead of starting at the front of the list, what if we started the search at a node in the middle?
Nodes to the right of this median point to the next node in the sequence, as they did before, but nodes to the left of the median now point to the previous node towards the start of the sequence.
The node in the middle should always know its position in the sequence since all operations pass through it.
</p>
<p>
Applying the same strategy recursively to the left and right sides of the median node produces a structure known as a <em>binary search tree</em>.
Every node now has two links, one to its left subtree and another to its right subtree.
A node might link to an empty subtree, known as <em>leaf</em> node, but the pointers themselves are still there because they are part of the allocated node structure.
There is an empty subtree at the start of the sequence, between every node, and at the end of the sequence.
</p>
<figure>
{{ $figure := figure "binary_search_tree" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: An illustration of the transformation from a linked list to a binary search tree.
The median node now has <em>two</em> pointers: one going left and another going right.
The initial median node, in this case <i>{{ inline "docs/plots/figures/d.svg" }}</i>, is at the top of the tree and is commonly referred to as the <em>root</em> node.
The horizontal coordinates of the nodes stay the same as the nodes line up vertically.
</p>
</figure>
<!-- <p>-->
<!-- The number of leaves is therefore always one more than the number of nodes, which suggests that about half of the pointers in a binary search tree point to nothing.-->
<!-- This is similar to the worst-case overhead in dynamic arrays after a reallocation that doubled the size of the array.-->
<!-- One way to avoid this is to use a special type of node without pointers at the bottom of the tree, and upgrade it to an extended structure with pointers as needed.-->
<!-- </p>-->
<figure>
{{ $figure := figure "binary_search_tree_large" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>:
A complete binary tree, not showing leaf nodes.
This tree seems large but has only 127 nodes — what would a tree of <span class="numeric">1,000,000</span> nodes and <span class="numeric">∼20</span> levels look like?
Notice the repeating patterns and that the number of nodes per horizontal level is twice the number of nodes of the level above.
</p>
</figure>
<p>
The fundamental rule that defines a binary search tree is that the nodes of the left subtree occur sequentially before the root node, and the nodes of the right subtree occur sequentially after the root node.
This creates a total linear ordering of all values, and the same sequence as the original linked list can be produced from the tree structure.
We do this with an <em>inorder tree traversal</em>, which visits every node in sequential order.
</p>
<pre>
<strong>inorder</strong> (p *Node, visit func(*Node)) {
if p == nil {
return
}
inorder(p.l, visit)
visit(p)
inorder(p.r, visit)
}
</pre>
<p class="caption">
<!-- This algorithm visits every node in the tree in sequential order: starting at the root, perform a traversal of the left subtree, then visit the root, then perform a traversal of the right subtree.-->
<!-- The first node visited by this algorithm is therefore the left-most node of the tree and the last node is the right-most node.-->
Starting at the root, perform an inorder traversal of the left subtree, then visit the root, then perform an inorder traversal of the right subtree.
The first node visited by this algorithm is therefore the left-most node of the tree, and the last node visited is the right-most node.
As an exercise, try to apply this algorithm manually to the tree at the bottom of Figure {{ (figure "binary_search_tree").Index }} starting at the root node <i>{{ inline "docs/plots/figures/d.svg" }}</i>.
<!-- To illustrate this, draw a binary search tree on a grid where the x-coordinate is the position of the node in the sequence, and the y-coordinate is the length of the longest path.-->
<!-- Dragging a vertical line across the page from left to right will intersect the nodes in sequential order.-->
<!-- The only address we have in hand at the start of the traversal is the root, so to get to the left-most node we need to follow along the <em>left spine</em> of the tree first, eventually coming back up to the root before descending to the left-most node of the right subtree, and so forth.-->
</p>
<section id="leaf-insertion">
<h4>Leaf insertion</h4>
<p>
The leaves of a binary search tree are where new nodes are attached to the structure.
To insert a new value at a given position, start by searching for the leaf corresponding to that position, then replace it with a new node allocated for the value, thereby also allocating two new leaves.
This is called <em>leaf insertion</em> and forms the basis of many other insertion algorithms.
</p>
<figure>
{{ $figure := figure "insert_leaf" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>:
The tree on the left shows the search path to insert a new node into the tree, starting from the root node down to a leaf node indicated by <i>{{ inline "docs/plots/figures/leaf.svg" }}</i>.
The tree on the right shows the resulting tree with a new node in place of the previous leaf node.
Keep in mind that the new node would have two leaf nodes.
</p>
</figure>
</section>
<section id="relative-position">
<h4>Relative position</h4>
<p>
We now have a binary tree data structure with nodes pointing to other nodes, but it is not yet clear how to search for a node by position.
Searching through a linked list is simple because we traverse and count exactly one node at a time.
The information obtained by counting is the position of every node along the way, where the <em>relative</em> position of every node is exactly 1.
<p>
</p>
When we follow a branch in a binary search tree, we effectively skip all the nodes of the <em>other</em> branch.
From the root of the tree, the first branch skips about half the nodes, then the second branch skips about half again, then half again until we find the node at the position we are searching for.
</p>
<p>
</p>
<!-- <p>-->
<!-- To implement binary search, we need to determine whether to branch to the left or the right; do we reject the left half or the right half?-->
<!-- In the context of a list implementation, this determination requires some <em>comparison by position</em> to either seek further forward or backward in the sequence, relative to the current node.-->
<!-- With linked lists, we tracked the current search position by counting every node one at a time, but a binary search tree can skip <em>multiple</em> nodes at a time.-->
<!-- To apply a similar counting strategy to binary trees, we need to know the number of nodes in the left subtree: if we were to reject all the nodes of the left subtree, would we skip too far ahead?-->
<!-- </p>-->
<p>
This is known as <em>binary search</em>, which is a well-known algorithm independent of linked nodes and tree structures.
For example, ask someone to think of a number between 1 and some large number <span class="math">n</span>.
How many guesses would we need to guess it correctly, if after every guess were are hinted to go lower or higher?
</p>
<p>
The answer is no more than {{ inline "docs/katex/math/inline/log_2n.katex.html" }} by using binary search.
Start in the middle, then to the middle of the left half if the guess was too high, or to the middle of the right half if the guess was too low.
Repeating this step until we guess correctly closes in on the number by halving the search space every time until there is only one choice left.
<!-- Consider for example a large dictionary where all the definitions are printed in alphabetical order.-->
<!-- How might we search for a particular definition?-->
<!-- Starting on the first page and searching one page at a time would be a <em>linear search</em>, and we will likely take a long time to find the definition we are looking for.-->
<!-- Our intuition might suggest to us that there is a more efficient way to do this.-->
<!-- Starting instead around the middle, though perhaps not the exact middle but somewhere close to it, we determine whether the page we are looking for is before or after the current page.-->
<!-- Because the directory is an <em>ordered</em> sequence, we know that we can reject either the left or right half entirely.-->
<!-- Repeating this step on the resulting half once again divides the search space in half, until eventually we close in on the page we are looking for.-->
</p>
<p>
To implement binary search in a binary tree, we need to determine whether to branch to the left or to the right; do we skip the left half or the right half?
Branching to the left seeks towards the start of the sequence by skipping all the nodes of the right subtree, and branching to the right seeks towards the end of the sequence by skipping all the nodes of the left subtree.
</p>
<p>
Since the pointers only reference in one direction away from the root, there is no way back up the search path once a branch is taken.
To determine which branch to take requires that we know the position of the current node relative to its tree, which is exactly the number of nodes in the left subtree, or the number of nodes that would be skipped by branching to the right.
</p>
<p>
Branching to the exact median at every step is not necessary for binary search to be effective.
Even a somewhat average approximation of each median might only increase the length of the search path by a few steps.
The subtrees might also not divide neatly down the middle, which is only possible when the size of the tree is exactly {{ inline "docs/katex/math/inline/2^n-1.katex.html" }}.
We therefore can not assume that the current node along the search path is the exact median of its subtree.
The only way to know the number of nodes in the left subtree is to either count them, or store that information as part of the node structure.
</p>
<p>
Perhaps the most common approach to solve this problem is to store the size of every node as part of its structure, which is equal to the number of nodes reachable from the node, including itself.
Adding a new node to the tree requires then that we increment the size field of every node along the path from the root to the new node.
To determine the size of the left subtree, and therefore the relative position of a node, we can follow its left link to read the size field of that node, wherein lies a weakness: we always follow the left link even when the search ends up branching to the right.
Ideally, to avoid unnecessary interaction with memory, the only nodes to read from memory are the nodes along the search path.
This strategy benefits from being symmetric, and being able to derive the size of a tree from the size of its root.
</p>
<p>
Another idea is to store in each node its position relative to its parent.
This approach results in a left node having a negative position equal to one less than the negative size of its right subtree, and a right node having a positive position one more than the size of its left subtree.
Assuming that the root node always has a positive position, the absolute position of any node is then equal to the sum of all relative positions along its path from the root.
This strategy is symmetrical, intuitive, and provides one bonus insight: a node is known to be a left or right descendant based on the sign of its position.
However, there are two downsides to this approach: (i) we require one bit to store the sign of the position, thereby halving the utilization of the size field, and (ii) the resulting algorithms require in some cases additional checks and arguments to indicate and normalize node orientation.
Insertion using this strategy would increment the size field when a search path branches left at a right node, and symmetrically decrement the size when branching right at a left node.
For example, inserting a node at the start of the sequence would increment the size of the root because the size of its left subtree is increasing, then descend along the left-most path without incrementing the size of any other nodes since they all store the size of their respective right subtrees.
Similarly, inserting a node at the end of the sequence would not increment the size of any node at all because all the nodes on the right-most path, including the root, store the size of their left subtrees unaffected by an insertion to the right.
In 2004, Schmücker{{ template "reference" "2004_jorg" }} proposed a list implementation using this approach to the Apache Commons Collections, which is still part of the source at the time of this writing.
</p>
<p>
Alternatively, we could store specifically the size of the left subtree, as suggested by Knuth{{ template "reference" "1973_knuth_taocp_vol_3_section_6" }} and Crane{{ template "reference" "1972_clark_allan_crane" }}.
Keeping a separate <em>size</em> field in the outer tree structure as well as the reference to the root node then allows us to track the current subtree size at each step along the search path.
The size of the right subtree is then one less than the size of the current subtree minus the size of its left subtree.
This approach allows us to know both the left and right subtree sizes without the need to access either of them.
Intuitively, this pulls the subtree size information one level up in the tree.
Insertion then only increments the size field of a node when branching to the left, because inserting to the right would not affect the size of the left subtree.
This approach therefore reduces memory interaction in exchange for a slightly more complex tree structure and some asymmetry.
</p>
<p>
This project uses the approach where every node stores the size of its left subtree.
</p>
<pre>
<strong>Node</strong> {
x Data
s Size
l *Node
r *Node
}
<strong>Tree</strong> {
root *Node
size Size
}</pre>
<p class="caption">
These are the node and tree structures used by the algorithms in this project.
A node consists of a data field, a size field, and two pointers to other or empty nodes.
A tree consists of a pointer to the root node, if any, and the current size of the tree.
</p>
<pre>
<strong>search</strong> (p *Node, i Position) *Node {
loop {
if i == p.s {
return p
}
if i < p.s {
p = p.l
} else {
i = i - p.s - 1
p = p.r
}
}
}</pre>
<p class="caption">
When the position we are searching for is equal to the number of nodes in the left subtree, we have found the node we were looking for.
Skipping all the nodes of the left subtree would skip ahead to exactly this position in the sequence.
Otherwise, when the position is less than the size of the left subtree, we need to seek towards the start of the sequence because our current position is still too far ahead.
In this case, follow the link to the left and continue the search.
Otherwise, if the position is greater than the size of the left subtree, we know that we can reject the entire left subtree because even if we skipped all those nodes we would still need to seek further ahead in the sequence, and therefore the node we are looking for must be somewhere in the right subtree.
In this case, reduce the search position by the size of the left subtree, including the current node, then descend to the right to continue the search.
</p>
</section>
</section>
<section id="path-length">
<h4>Path length</h4>
<p>
The <em>path length</em> of a node is the number of links to follow to reach it from the root of the tree, or one less than the number of nodes along its path from the root.
The total path length is the sum of the path length from the root to every other node, and the <em>average path length</em> is the total path length divided by the number of nodes.
Starting from halfway along some path and turning either left or right effectively halves the maximum path length of that path.
During the transformation of a linked list to a binary search tree, as shown in Figure {{ (figure "binary_search_tree").Index }}, whenever a new branch is created to the middle of a sublist it halves the maximum path length of that sublist.
What is the resulting maximum path length of the tree?
This question is effectively asking how many times the size of the tree can be divided by 2 before reaching 1.
</p>
<p>
This is known as the <em>binary logarithm</em>, written as {{ inline "docs/katex/math/inline/log2.katex.html" }}.
The binary log is frequently encountered in computer science because it relates so closely to binary numbers.
For example, the number of bits required to write an integer <span class="math">n</span> in binary is {{ inline "docs/katex/math/inline/floor_log2n_plus_1.katex.html" }}.
The same reasoning applies to decimal numbers, where the number of digits required to write an integer <span class="math">n</span> in decimal is {{ inline "docs/katex/math/inline/floor_log10n_plus_1.katex.html" }}, or the number of times we can divide by 10 until we get to 1.
A binary number therefore shortens by 1 bit when we divide by 2 and a decimal number shortens by 1 digit when we divide by 10.
</p>
</section>
<div class="page-break"></div>
<section id="delete">
<h4>Deletion</h4>
<p>
The general strategy to delete a node from a binary search tree is to search for the node to be deleted, then to replace it by the <em>join</em> of its subtrees.
A valid join function combines the left and right subtrees so that all the nodes in the left subtree still occur before all the nodes in the right subtree, while sharing a common root node.
The problem to solve is to determine from which subtree the root of the join should come from, and exactly how to produce it.
</p>
<p>
In 1961, Hibbard{{ template "reference" "1961_hibbard" }} proposed a simple algorithm to join two subtrees: when the right subtree is empty, the result of the join is the left subtree as-is, otherwise when the right subtree is <em>not</em> empty, delete its left-most node and use that node as the joining root.
This algorithm therefore always produces the joining root from the right subtree if possible.
Hibbard’s paper was remarkable in that it contained one of the first formal theorems about algorithms and still features frequently in lectures and textbooks.
</p>
<figure>
{{ $figure := figure "hibbard" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: The root node <i>{{ inline "docs/plots/figures/d.svg" }}</i> is being deleted.
This requires a join of the subtrees rooted at <i>{{ inline "docs/plots/figures/b.svg" }}</i> and <i>{{ inline "docs/plots/figures/f.svg" }}</i> to replace <i>{{ inline "docs/plots/figures/d.svg" }}</i>.
There are two choices for the joining root: the tree in the bottom left shows the result of using the right-most node of the left subtree, therefore <i>{{ inline "docs/plots/figures/c.svg" }}</i>, and the tree in the bottom right shows the result of using the left-most node of the right subtree, therefore <i>{{ inline "docs/plots/figures/e.svg" }}</i>.
Notice that the subtree <i>{{ inline "docs/plots/figures/tree.svg" }}</i> of the joining root is never accessed and replaces the joining root at the bottom of the path.
</p>
</figure>
<p>
Hibbard's algorithm is not symmetric — it never considers whether the left subtree is empty, only the right.
Knuth{{ template "reference" "1973_knuth_taocp_vol_3_section_6" }} noticed this asymmetry and suggested an adjustment to symmetrically use the right subtree as-is when the left subtree is empty.
This algorithm improves the symmetry but is still biased to one side because the joining root is always taken from the same side when neither subtree is empty.
</p>
<p>
In 1983, Eppinger{{ template "reference" "1983_eppinger" }} suggested a symmetric variant of Hibbard's algorithm that randomly prefers the left or the right with equal probability.
Their experiments involve random insertions and deletions and measure the average path length over time.
They found that the average path length actually improves when using their symmetric approach, eventually stabilizing to be better than that of random trees.
</p>
<p>
In 1989, Culberson and Munro{{ template "reference" "1989_culberson_munro" }}{{ template "reference" "1986_culberson" }} published an extensive study into the long-term behavior of random insertions and deletions.
They presented evidence that the use of the Hibbard or Knuth algorithms, when coupled with random insertions, causes the average path length to increase after many iterations, then stabilize.
They also mention the symmetric approach by Eppinger and show that the same idea when applied to Knuth's algorithm yields even better results.
</p>
<p>
Popular textbooks are inconsistent in their approach to deletion in binary search trees.
For example,
in the well-known <em>Introduction to Algorithms</em>, Cormen, Leiserson, Rivest, and Stein{{ template "reference" "2022_clrs_delete" }} teach Knuth's algorithm,
as well as Drozdek{{ template "reference" "2013_drozdek_delete" }} who attributes the algorithm to both Hibbard and Knuth.
Sedgewick and Wayne{{ template "reference" "2011_sedgewick_wayne_delete" }} teach Hibbard's algorithm, but their code appears to implement Knuth's algorithm.
Goodrich, Tamassia and Goldwasser{{ template "reference" "2013_goodrich_tamassia_goldwasser_delete" }} teach Hibbard's algorithm but flips it to always prefer the left subtree instead.
Skiena,{{ template "reference" "2008_skiena_delete" }}
Weiss,{{ template "reference" "2014_weiss_delete" }}
and Manber{{ template "reference" "1989_manber" }} all teach Hibbard's algorithm.
</p>
<p>
So far, none of the algorithms consider subtree size to determine from which side to produce the joining root.
Given that subtree size must be known to search by position, we can make use of this information to determine whether to produce the root from the left or the right subtree.
</p>
<p>
Consider the node to be deleted and its position relative to its median — a node to the left of its median has a larger right subtree, and a node to the right of its median has a larger left subtree.
The difference between these subtree sizes can be thought of as the distance to the ideal branch.
This distance can be reduced slightly by always producing the root from the larger subtree.
</p>
<p>
The inverse of this strategy is to instead always prefer the <em>smaller</em> subtree.
While this might seem clearly counter-productive, consider that the path length to the joining root is likely to be shorter in the smaller subtree than the larger subtree.
Generally, a lower total traversal distance corresponds to fewer interactions with memory since fewer nodes need to be accessed.
In this case, while the average path length of the resulting tree might be longer than other strategies, the join itself might resolve faster by accessing fewer nodes on average.
</p>
<p>
A related idea is a practical combination of preferring the larger subtree and the randomized symmetry of Eppinger's algorithm.
To make the determination, generate a random number no greater than the sum of the two subtree sizes, then prefer the left subtree if that number is less than the size of the left subtree.
When the left subtree is smaller, there is a lower probability that the generated number will be less than its size, and therefore a higher probability that the larger subtree will be chosen.
This results in a lower average path length than the symmetric algorithm of Eppinger, and both the Hibbard and Knuth algorithms.
This is to be expected, because the Eppinger algorithm has a 50% chance of choosing the larger subtree, and the randomized strategy still has some chance of choosing the smaller subtree.
<!-- Another downside of randomization is the cost of generating random numbers, which Eppinger avoids by using a simple alternating bit, but is unavoidable when generating a bounded random number.-->
</p>
<p>
To compare the resulting average path length of each strategy, a random tree of <span class="numeric">1,000</span> nodes was constructed and measured across <span class="numeric">1,000,000,000</span> random insertions and deletions.
The results show that always preferring the larger subtree results in the lowest average path length.
All the evaluated algorithms appear to stabilize with an average path length that is logarithmic in the size of the tree.
</p>
<figure>
{{ $figure := figure "delete" }}
{{ inline $figure.URL }}
</figure>
</section>
<section id="persistence">
<h3>Persistence</h3>
<p>
A <em>persistent</em> data structure can create many independent versions of itself over time.
Any version can be modified to produce a new version without affecting existing versions.
To achieve this, a program can preserve the current version by copying it before making changes to produce a new version.
</p>
<!-- https://en.wikipedia.org/wiki/Persistent<em>data</em>structure-->
<section id="reference-counting">
<h4>Reference counting</h4>
<p>
To avoid copying all the nodes when copying a tree, we can allow trees to share common subtrees in memory.
Some tree far in the future might still reference a node that was allocated for a much earlier version.
We need to keep track of this, so we store in every node a <em>reference count</em> to indicate the number of other trees that also reference that node.
</p>
<p>
When the reference count is zero, it suggests that no other trees share that node, so the tree has ownership and may modify it.
Otherwise, when the reference count is greater than zero, it suggests that at least one other tree shares that node in memory.
In this case, modifying the node would unintentionally also modify the other trees that reference it, so the node must first be detached from its shared state.
This can be done by replacing the node with a copy, setting the reference count of the copy to zero, sharing the left and right subtrees by incrementing their reference counts, and decrementing the reference count of the initial node since it is no longer referenced by this tree.
</p>
<!-- https://en.wikipedia.org/wiki/Copy-on-write-->
<!-- https://en.wikipedia.org/wiki/Reference_counting-->
<!-- https://en.wikipedia.org/wiki/Immutable_object-->
<!---->
</section>
<section id="path-copying">
<h4>Path copying</h4>
<p>
Consider now that a node is being modified in the depths of some tree, but that node is shared so must first be copied.
How can the tree know about the copy of the node being modified?
How could its root reach it?
When a node replaces another node on the search path, there is a parent node that must be modified to now point to the copy instead, so it too must be copied.
This cascades all the way up to the root.
</p>
<p>
Generally, all paths that lead to a modification must be copied.
Every node has a unique path from the root, so for any operation we can mark the nodes that need to be modified and trace their paths back to the root;
it is these paths that need to be copied so that they all belong to the same new version of the tree that includes the result of the modification.
</p>
<p>
Binary search trees are especially well-suited for persistence because the amount of information to copy is bounded by the maximum path length, which is usually logarithmic in the size of the tree.
When this is the case, for example, a tree of <span class="numeric">1,000,000</span> nodes would only need to copy <span class="numeric">∼20</span> nodes to produce a new, completely independent version.
</p>
<figure>
{{ $figure := figure "persistence" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: An illustration of path copying.
The tree in the top left highlights a search path to insert a new node, the tree in the center left shows the deletion of a node, and the tree in the bottom left shows an insertion along the left-most path.
Notice that the resulting tree in the bottom right is composed from subtrees of previous versions.
</p>
</figure>
<p>
Other persistent tree data structures such as B-trees{{ template "reference" "1972_bayer_mccreight" }}{{ template "reference" "2006_rodeh" }} and RRB-trees{{ template "reference" "2012_bagwell_rompf" }}{{ template "reference" "2015_stucki_rompf_ureche_bagwell" }}{{ template "reference" "2017_puente" }} pack several values into each node and have more outgoing links per node.
The motivation is usually (i) to reduce memory allocation by not requiring a new node for every value, (ii) to reduce path length by having more branches, and (iii) to read multiple values at a time from memory during search.
In a persistent context, these nodes must be copied no differently, but instead of copying just one value we need to copy all the values in the node.
Search paths may be shorter but there is more information to copy.
</p>
<p>
The number of values to copy can be estimated by {{ inline "docs/katex/math/inline/mlog_b(n).katex.html" }}, where <span class="math">m</span> is the number of values per node, <span class="math">b</span> is the number of branches per node, and <span class="math">n</span> is the number of nodes in the tree.
This is not an exact science of course, because actual path lengths may vary and some nodes with larger capacities might not be full.
This expression therefore only provides an idea of the expected growth of information to copy as size increases.
</p>
<p>
Binary trees use {{ inline "docs/katex/math/inline/m_eq_1.katex.html" }} and {{ inline "docs/katex/math/inline/b_eq_2.katex.html" }}, one value, two branches.
B-trees in the Rust standard library use {{ inline "docs/katex/math/inline/m_eq_11.katex.html" }} and {{ inline "docs/katex/math/inline/b_eq_12.katex.html" }}, and RRB-trees use {{ inline "docs/katex/math/inline/m_eq_32.katex.html" }} and {{ inline "docs/katex/math/inline/b_eq_5.katex.html" }}.
</p>
<figure>
{{ $figure := figure "mlogm" }}
{{ inline $figure.URL }}
</figure>
</section>
<section id="parent-pointers">
<h4>Parent pointers</h4>
<p>
Many implementations of binary search trees use a node structure that includes a third link: the <em>parent</em> pointer, which points back up towards the root.
Path copying requires that all paths that lead to a modification must be copied, but parent pointers create cycles such that all paths lead to all nodes.
Copying every node is not feasible, so parent pointers are not compatible with path copying and therefore not part of our node structure.
Avoiding parent pointers also reduces memory interaction because there are fewer pointers to maintain, and no need to account for the pointer in the node structure.
</p>
<figure>
{{ $figure := figure "parent_pointers" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: An illustration of a binary tree that uses nodes with parent pointers.
</p>
</figure>
</section>
</section>
<section id="concurrency">
<h3>Concurrency</h3>
<p>
A data structure that supports <em>concurrency</em> can run multiple independent operations at the same time.
This is not exactly <em>parallelism</em>, which is where the work of a single operation is divided up to be worked on together.
To illustrate this in a binary search tree, as shown in Figure {{ (figure "concurrency").Index }}, consider one operation that searches for an existing node by position, and another that inserts a new node.
These operations may start at the same time, but only one of them can win the race to the root node and lock others out.
The other operation must wait for the node to be unlocked to gain access.
</p>
<p>
The search operation branches to the left, locks that node, and only then unlocks the root node.
The insert operation is now clear to access the root and lock it.
The second operation may be locked out again if it also branches to the left, but if it branches to the right then the two operations become independent.
</p>
<figure>
{{ $figure := figure "concurrency" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: Two concurrent operations, one is a search and the other is an insertion.
The first operation branches to the left at the root, and the second operation branches to the right at the root.
Assuming that both operations are top-down, as soon as they branch in different directions they become independent, allowing them to operate concurrently.
</p>
</figure>
<section id="recursion">
<h4>Recursion</h4>
<p>
Binary tree algorithms often use <em>recursion</em> to implement operations in two phases: a search phase descending from the root, and a maintenance phase ascending back to the root.
Another operation would need to wait for the algorithm to go all the way down then come all the way back up to make final adjustments before unlocking the node.
For this reason, we explore whenever possible the potential to avoid recursion by combining the search and maintenance phases into a single top-down phase in constant space.
Recursive implementations are still valuable because they are often simpler to implement and help to verify top-down variants.
</p>
</section>
</section>
<!-- <section id="access-patterns">-->
<!-- <h3>Access patterns</h3>-->
<!-- <p>-->
<!-- </p>-->
<!-- <section id="distributions">-->
<!-- <h4>Number distributions, benchmarks, and measurements</h4>-->
<!-- <p>-->
<!-- </p>-->
<!-- </section>-->
<!-- </section>-->
</section>
<section id="restoring-balance">
<h2><span>Part 2</span>Restoring Balance</h2>
<p>
Consider what happens to the tree structure when we repeatedly insert nodes at the start of the sequence.
Eventually, the structure becomes <em>unbalanced</em> because too many branches are too far from their median.
Balance is a measure of low average path length, and a linked list can be thought of as a tree with the worst possible balance.
<!-- Starting from an empty tree and always inserting at the start of the sequence would create precisely a linked list.-->
</p>
<p>
To restore balance, we need a way to lower the average path length without changing the sequence of the nodes.
Some strategies spend a lot of energy to restore perfect balance, while others spend less energy by being more relaxed.
We can evaluate this as a trade-off between the cost of balancing and the benefits of balance.
</p>
<!-- <p>-->
<!-- A node only stores references to other nodes, not the nodes themselves, so to get the information of another node we must read it from memory.-->
<!-- The program must therefore interact with memory whenever it follows a link from one node to another.-->
<!-- Reducing the number of links to follow is therefore a strategy to reduce interaction with memory.-->
<!-- </p>-->
<figure>
{{ $figure := figure "balance" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: An unbalanced tree on the left, and the same tree but balanced on the right.
</p>
</figure>
<section id="rotations">
<h3>Rotations</h3>
<p>
A tree <em>rotation</em> is a local structure adjustment that rearranged links between nodes without changing their order.
The effect of a rotation can be observed by focusing on the root of the rotation and the change in the composition of its subtrees.
</p>
<p>
There are two symmetrical rotations: a right rotation which moves the root node down to the right and the left node up to the right to take its place,
and a left rotation which moves the root node down to the left and the right node up to the left to take its place.
</p>
<figure>
{{ $figure := figure "rotations" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>:
From the left tree to the right tree is a <em>right rotation</em>:
push <i>{{ inline "docs/plots/figures/c.svg" }}</i> down to the right,
pull <i>{{ inline "docs/plots/figures/a.svg" }}</i> up into its place,
move <i>{{ inline "docs/plots/figures/b.svg" }}</i> across to the right.
From the right tree to the left tree is a <em>left rotation</em>:
push <i>{{ inline "docs/plots/figures/a.svg" }}</i> down to the left,
pull <i>{{ inline "docs/plots/figures/c.svg" }}</i> up into its place,
move <i>{{ inline "docs/plots/figures/b.svg" }}</i> across to the left.
Notice that <i>{{ inline "docs/plots/figures/c.svg" }}</i> has both <i>{{ inline "docs/plots/figures/a.svg" }}</i> and <i>{{ inline "docs/plots/figures/b.svg" }}</i>
in its left subtree on the left, then after the right rotation only has <i>{{ inline "docs/plots/figures/b.svg" }}</i>.
Both trees have the same sequence,
<i>{{ inline "docs/plots/figures/a.svg" }}</i> <i>{{ inline "docs/plots/figures/b.svg" }}</i> <i>{{ inline "docs/plots/figures/c.svg" }}</i>.
Subtrees marked <i>{{ inline "docs/plots/figures/tree.svg" }}</i> are not accessed or modified.
</p>
</figure>
<pre>
<strong>rotateR</strong> (p *Node) *Node
l = p.l
p.l = l.r
l.r = p
p.s = p.s - l.s - 1
return l
}
<strong>rotateL</strong> (p *Node) *Node
r = p.r
p.r = r.l
r.l = p
r.s = r.s + p.s + 1
return r
}</pre>
<p>
A well-known strategy to restore balance using rotations is the Day-Stout-Warren algorithm, or simply DSW, which was designed by Stout and Warren{{ template "reference" "1986_dsw" }} in 1986 based on work by Day{{ template "reference" "1976_day" }} in 1976.
This algorithm first transforms a tree into a linked list using right rotations, then transforms that linked list into a perfectly balanced tree using left rotations — all done in linear time and constant space.
</p>
</section>
<section id="partitioning">
<h3>Partitioning</h3>
<p>
In 1980, Stephenson{{ template "reference" "1980_stephenson" }} presented an algorithm that always inserts a new node as the root of a tree, commonly referred to as <em>root insertion</em>.
This is done by splitting the tree in two: nodes that occur before the leaf position and nodes that occur after the leaf position, then setting those subtrees as the left and right subtrees of the new node.
</p>
<p>
A related algorithm is <em>partition</em>, which rearranges the tree to make the node at a given position the root while preserving the order of all nodes.
To partition a tree, start from the root and follow the standard search algorithm.
Along the search path, when branching to the left, the current node wil end up to the right of the partitioned root, and all further nodes along the search path will end up to the left of the current node.
This applies symmetrically when branching to the right.
</p>
<p>
Using this information, the algorithm builds two subtrees top-down, the left partition and the right partition.
When the search branches to the left, attach the current node to the left of the left-most node of the right partition, preserving the right subtree of the current node.
When the search branches to the right, attach the current node to the right of the right-most node of the left partition, preserving its left subtree.
</p>
<p>
When the search reaches the node that is to become the root, attach its current left subtree to the right of the right-most node of the left partition, and attach its current right subtree to the left of the left-most node of the right partition.
Then, set its left subtree to be the root of the left partition, and set its right subtree to be the root of the right partition.
Finally, set this node as the root of the tree.
</p>
<pre>
<strong>partition</strong> (p *Node, i Position) *Node {
n = Node{s: i}
l = &n
r = &n
for i ≠ p.s {
if i < p.s {
r.l = p
p.s = p.s - i - 1
r = r.l
p = p.l
} else {
l.r = p
i = i - p.s - 1
l = l.r
p = p.r
}
}
r.l = p.r
l.r = p.l
p.l = n.r
p.r = n.l
p.s = n.s
return p
}</pre>
<figure>
{{ $figure := figure "partition" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>:
The tree at the bottom is the result of partitioning the tree at the top at <i>{{ inline "docs/plots/figures/c.svg" }}</i>.
The search path is indicated by arrows, and subtrees marked <i>{{ inline "docs/plots/figures/tree.svg" }}</i> are not accessed or modified.
</p>
</figure>
</section>
<section id="median-balance">
<h3>Median balance</h3>
<div class="definition">
<p class="underline">Definition</p>
<p class="indent">A node is <em>median-balanced</em> if the size of its left and right subtrees differ by no more than 1.</p>
</div>
<p>
To determine if a node is median-balanced, we can add 1 to the size of its smaller subtree to see if it becomes greater than or equal to the size of its larger subtree.
Without loss of generality, assume that {{ inline "docs/katex/math/inline/x_leq_y.katex.html" }}.
</p>
<p>
{{ inline "docs/katex/math/display/balance_median.katex.html" }}
</p>
<p>
A median-balanced tree is perfectly balanced because every branch divides the search space exactly in half.
In 2017, Muusse{{ template "reference" "2017_muusse" }} published an algorithm to balance a binary search tree that uses <em>partition</em> to replace every node by its underlying median to produce a median-balanced tree.
</p>
<p>
Partitioning a tree around its median distributes the nodes evenly between its left and right subtrees, but there might be nodes deeper within those subtrees that are not median-balanced.
Repeating the partitioning recursively in the left and right subtrees results in a median-balanced tree.
</p>
<pre>
<strong>balance</strong> (p *Node, s Size) *Node {
if p == nil {
return p
}
if !balanced(p) {
p = partition(p, s / 2)
}
p.l = balance(p.l, size(p.l))
p.r = balance(p.r, size(p.r))
return p
}
</pre>
<p>
This algorithm has several useful properties:
(i) it is general for any definition of balance based on subtree size,
(ii) it operates top-down in constant space,
(iii) it is particularly efficient when the tree is already somewhat-balanced,
(iv) subtree balancing is independent so could be done parallel, and
(v) the balancing process could be stopped or paused partway through without invalidating the structure.
</p>
<p>
There are however some node arrangements that are not median-balanced but have the same average path length as if they were median-balanced.
Consider the following trees that both form the same sequence of 5 nodes with an average path length of 6/5.
</p>
<figure>
{{ $figure := figure "median_balance" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>: Two trees with the same average path length and maximum path length, where one is median-balanced and the other is not.
To calculate average path length, first count the total path length then divide by the size of the tree.
The tree on the left has path lengths A:1 B:0 C:2 D:1 E:2 for a total of 6, has 5 nodes, and is not median-balanced.
The tree on the right has path lengths A:2 B:1 C:0 D:1 E:2 for a total of 6, also has 5 nodes, and is median-balanced.
</p>
</figure>
<p>
The first tree is not median-balanced because 5/2 is 2, so the median node is <i>{{ inline "docs/plots/figures/c.svg" }}</i> but the root is currently <i>{{ inline "docs/plots/figures/b.svg" }}</i>.
The median-balancing strategy would partition at <i>{{ inline "docs/plots/figures/b.svg" }}</i> to make <i>{{ inline "docs/plots/figures/c.svg" }}</i> the root, but the average path length stays the same.
This partitioning step is therefore redundant because it did not improve the balance of the tree.
</p>
</section>
<section id="height-balance">
<h3>Height balance</h3>
<div class="definition">
<p class="underline">Definition</p>
<p class="indent">The <em>height</em> of a tree is the length of the longest path starting from its root.</p>
</div>
<div class="definition">
<p class="underline">Definition</p>
<p class="indent">A node is <em>height-balanced</em> if the height of its left and right subtrees differ by no more than 1.</p>
</div>
<p>
We can continue to use the same partition-based balancing algorithm but change the definition of balance to only partition if the node is not already height-balanced — every node that is not height-balanced is partitioned to become median-balanced.
This results in a height-balanced tree because every node that is already height-balanced is left as such, and every node that is not height-balanced becomes median-balanced.
</p>
<p>
The problem to solve is to determine whether a node is already height-balanced.
Muusse{{ template "reference" "2017_muusse" }} solves this by using a strict definition of height-balance where both the minimum and maximum path lengths of two subtrees differ by no more than 1.
Then, by pretending that both subtrees are already strictly height-balanced, we can compare their minimum and maximum heights.
<p>
A binary tree is <em>perfect</em> if every node has either two subtrees or two empty subtrees.
The height of a perfect binary tree is {{ inline "docs/katex/math/inline/log2(s_plus_1)_minus_1.katex.html" }}, which is intuitive since every level of the tree has twice the number of nodes as the one above it and the number of branches to follow from the root to the bottom of the tree is the number of times that the size can be divided by 2 before reaching 1.
</p>
<figure>
{{ $figure := figure "perfect_trees" }}
{{ inline $figure.URL }}
<p class="caption">
<strong>Figure {{ $figure.Index }}</strong>:
Perfect binary trees of size {{ inline "docs/katex/math/inline/2^n-1.katex.html" }}.
</p>
</figure>
<p>
When a tree is strictly height-balanced, all the levels of the tree are full except for maybe the bottom level, where some nodes might have only 1 subtree.
The bottom level of the smaller subtree is emptied with <em>floor</em>, and the bottom level of the larger subtree is completed with <em>ceil</em>, giving the minimum and maximum heights of the two subtrees.
We can then determine if the node is height-balanced by comparing these heights.
</p>
<p>
{{ inline "docs/katex/math/display/balance_height.katex.html" }}
</p>
<p>
This function, as published, can be simplified with an identity of the discrete binary logarithm where {{ inline "docs/katex/math/inline/binary_log_identity.katex.html" }}.
This allows both sides of the inequality to be expressed in terms of the floor of the log.
The result is the same whether the last level is completed with <em>ceil</em> or emptied with <em>floor</em> and incremented.
</p>
<p>
{{ inline "docs/katex/math/display/balance_height_simplified.katex.html" }}
</p>