|
1 | | -(* |
2 | | - # General idea |
3 | | - |
4 | | - It is the easiest to explain the general idea on an array of infinite size. |
5 | | - Let's start with that. Each element in such an array constitutes a single-use |
6 | | - exchange slot. Enqueuer increments [tail] and treats prior value as index of |
7 | | - its slot. Same for dequeuer and [head]. This effectively creates pairs |
8 | | - (enqueuer, dequeuer) assigned to the same slot. Enqueuer leaves the value in |
9 | | - the slot, dequer copies it out. |
10 | | - |
11 | | - Enqueuer never fails. It always gets a brand-new slot and places item in it. |
12 | | - Dequeuer, on the other hand, may witness an empty slot. That's because [head] |
13 | | - may jump behind [tail]. Remember, indices are implemented blindy. For now, |
14 | | - assume dequeuer simply spins on the empty slot until an item appears. |
| 1 | +include Lockfree.Relaxed_queue |
15 | 2 |
|
16 | | - That's it. There's a few things flowing from this construction: |
17 | | - * Slots are atomic. This is where paired enqueuer and dequeuer communicate. |
18 | | - * [head] overshooting [tail] is a normal condition and that's good - we want |
19 | | - to keep operations on [head] and [tail] independent. |
20 | | -
|
21 | | - # Finite array |
22 | | -
|
23 | | - Now, to make it work in real-world, simply treat finite array as circular, |
24 | | - i.e. wrap around when reached the end. Slots are now re-used, so we need to be |
25 | | - more careful. |
26 | | - |
27 | | - Firstly, if there's too many items, enqueuer may witness a full slot. Let's assume |
28 | | - enqueuer simply spins on full slot until some dequeuer appears and takes the old |
29 | | - value. |
30 | | - |
31 | | - Secondly, in the case of overlap, there can be more than 2 threads (1x enqueuer, |
32 | | - 1x dequeuer) assigned to a single slot (imagine 10 enqueuers spinning on an 8-slot |
33 | | - array). In fact, it could be any number. Thus, all operations on slot have to use |
34 | | - CAS to ensure that no item is overwrriten on store and no item is dequeued by two |
35 | | - threads at once. |
36 | | -
|
37 | | - Above works okay in practise, and there is some relevant literature, e.g. |
38 | | - (DOI: 10.1145/3437801.3441583) analyzed this particular design. There's also |
39 | | - plenty older papers looking at similar approaches |
40 | | - (e.g. DOI: 10.1145/2851141.2851168). |
41 | | -
|
42 | | - Note, this design may violate FIFO (on overlap). The risk can be minimized by |
43 | | - ensuring size of array >> number of threads but it's never zero. |
44 | | - (github.com/rigtorp/MPMCQueue has a nice way of fixing this, we could add it). |
45 | | -
|
46 | | - # Blocking (non-lockfree paths on full, empty) |
47 | | -
|
48 | | - Up until now [push] and [pop] were allowed to block indefinitely on empty and full |
49 | | - queue. Overall, what can be done in those states? |
50 | | -
|
51 | | - 1. Busy wait until able to finish. |
52 | | - 2. Rollback own index with CAS (unassign itself from slot). |
53 | | - 3. Move forward other index with CAS (assign itself to the same slot as opposite |
54 | | - action). |
55 | | - 4. Mark slot as burned - dequeue only. |
56 | | -
|
57 | | - Which one then? |
58 | | -
|
59 | | - Let's optimize for stability, i.e. some reasonable latency that won't get much worse |
60 | | - under heavy load. Busy wait is great because it does not cause any contention in the |
61 | | - hotspots ([head], [tail]). Thus, start with busy wait (1). If queue is busy and |
62 | | - moving fast, there is a fair chance that within, say, 30 spins, we'll manage to |
63 | | - complete action without having to add contention elsewhere. |
64 | | - |
65 | | - Once N busy-loops happen and nothing changes, we probably want to return even if its |
66 | | - costs. (2), (3) both allow that. (2) doesn't add contention to the other index like |
67 | | - (3) does. Say, there's a lot more dequeuers than enqueuers, if all dequeurs did (3), |
68 | | - they would add a fair amount of contention to the [tail] index and slow the |
69 | | - already-outnumbered enqueuers further. So, (2) > (3) for that reason. |
70 | | -
|
71 | | - However, with just (2), some dequeuers will struggle to return. If many dequeuers |
72 | | - constatly try to pop an element and fail, they will form a chain. |
73 | | -
|
74 | | - tl hd |
75 | | - | | |
76 | | - [.]-[A]-[B]-[C]-..-[X] |
77 | | -
|
78 | | - For A to rollback, B has to rollback first. For B to rollback C has to rollback first. |
79 | | -
|
80 | | - [A] is likely to experience a large latency spike. In such a case, it is easier for [A] |
81 | | - to do (3) rather than hope all other active dequeuers will unblock it at some point. |
82 | | - Thus, it's worthwile also trying to do (3) periodically. |
83 | | -
|
84 | | - Thus, the current policy does (1) for a bit, then (1), (2) with periodic (3). |
85 | | -
|
86 | | - What about burned slots (4)? |
87 | | -
|
88 | | - It's present in the literature. Weakly I'm not a fan. If dequeuers are faster to remove |
89 | | - items than enqueuers supply them, slots burned by dequeuers are going to make enqueuers |
90 | | - do even more work. |
91 | | -
|
92 | | - # Resizing |
93 | | -
|
94 | | - The queue does not support resizing, but it can be simulated by wrapping it in a |
95 | | - lockfree list. |
96 | | -*) |
97 | | - |
98 | | -type 'a t = { |
99 | | - array : 'a Option.t Atomic.t Array.t; |
100 | | - head : int Atomic.t; |
101 | | - tail : int Atomic.t; |
102 | | - mask : int; |
103 | | -} |
104 | | - |
105 | | -let create ~size_exponent () : 'a t = |
106 | | - let size = 1 lsl size_exponent in |
107 | | - let array = Array.init size (fun _ -> Atomic.make None) in |
108 | | - let mask = size - 1 in |
109 | | - let head = Atomic.make 0 in |
110 | | - let tail = Atomic.make 0 in |
111 | | - { array; head; tail; mask } |
| 3 | +module Spin = struct |
| 4 | + let push = push |
| 5 | + let pop = pop |
| 6 | +end |
112 | 7 |
|
113 | 8 | (* [ccas] A slightly nicer CAS. Tries without taking microarch lock first. Use on indices. *) |
114 | 9 | let ccas cell seen v = |
115 | 10 | if Atomic.get cell != seen then false else Atomic.compare_and_set cell seen v |
116 | 11 |
|
117 | | -module Spin = struct |
118 | | - let push { array; tail; mask; _ } item = |
119 | | - let tail_val = Atomic.fetch_and_add tail 1 in |
120 | | - let index = tail_val land mask in |
121 | | - let cell = Array.get array index in |
122 | | - while not (ccas cell None (Some item)) do |
123 | | - Domain.cpu_relax () |
124 | | - done |
125 | | - |
126 | | - let pop { array; head; mask; _ } = |
127 | | - let head_val = Atomic.fetch_and_add head 1 in |
128 | | - let index = head_val land mask in |
129 | | - let cell = Array.get array index in |
130 | | - let item = ref (Atomic.get cell) in |
131 | | - while Option.is_none !item || not (ccas cell !item None) do |
132 | | - Domain.cpu_relax (); |
133 | | - item := Atomic.get cell |
134 | | - done; |
135 | | - Option.get !item |
136 | | -end |
137 | | - |
138 | 12 | module Not_lockfree = struct |
139 | 13 | (* [spin_threshold] Number of times on spin on a slot before trying an exit strategy. *) |
140 | 14 | let spin_threshold = 30 |
|
0 commit comments