Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optional flag for rotating index #37

Merged
merged 7 commits into from
Sep 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ tests and further documentation are to follow when time allows.

[The full API documentation is kept up-to-date on GitHub.](https://nim-works.github.io/loony/loony.html)

[The API documentation for the Ward submodule is found here.](https://nim-works.github.io/loony/loony/ward.html)
[~~The API documentation for the Ward submodule is found here.~~](https://nim-works.github.io/loony/loony/ward.html) ~~*Wards are untested and are unlikely to remain in the library*~~

#### Memory Safety & Cache Coherence

Expand All @@ -114,6 +114,19 @@ committed on the push operation and read on the pop operation; this is a
higher-cost primitive. You can use `unsafePush` and `unsafePop` to manipulate
a `LoonyQueue` without regard to cache coherency for ultimate performance.

The LoonyQueue itself is padded across cachelines, and by default, the slots
are read and written to in a cyclic fashion over cachelines to reduce false
sharing.

```
Visual representation of rotating index

| 64 bytes | 64 bytes | 64 bytes |...
| 0------- | 1------- | 2------- |...
| -63------| -64------| -65------|...
|--127-----|--128-----|--129-----|...
```

### Debugging

Pass `--d:loonyDebug` in compilation or with a config nimscript to use debug
Expand All @@ -140,8 +153,20 @@ debugNodeCounter:
We recommend against changing these values unless you know what you are doing. The suggested max alignment is 16 to achieve drastically higher contention capacities. Compilation will fail if your alignment does not fit the slot count index.

`-d:loonyNodeAlignment=11` - Adjust node alignment to increase/decrease contention capacity

`-d:loonySlotCount=1024` - Adjust the number of slots in each node

`-d:loonyDebug=false` - Toggle debug counters and templates, see
[debugging](#debugging). False by default.

`-d:loonyRotate=true` - Toggle the index for the slots of
loony queue to be read over cacheline bounds in a cyclic
manner. True by default.

> While loonyRotate is enabled, the slot count must be a
> power of 2. Error messages will indicate whether this
> is a cause of compilation failure.

## What are Continuations?

If you've somehow missed the next big thing for nim; see [CPS](https://github.com/nim-works/cps)
13 changes: 10 additions & 3 deletions loony.nim
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,16 @@ type

LoonyQueue*[T] = ref LoonyQueueImpl[T]
LoonyQueueImpl*[T] = object
head : Atomic[TagPtr] ## Whereby node contains the slots and idx
tail : Atomic[TagPtr] ## is the uint16 index of the slot array
currTail : Atomic[NodePtr] ## 8 bytes Current NodePtr
head {.align: 128.}: Atomic[TagPtr] ## Whereby node contains the slots and idx
tail {.align: 128.}: Atomic[TagPtr] ## is the uint16 index of the slot array
currTail {.align: 128.}: Atomic[NodePtr] ## 8 bytes Current NodePtr
# Align to 128 bytes to avoid false sharing, see:
# https://stackoverflow.com/questions/72126606/should-the-cache-padding-size-of-x86-64-be-128-bytes
# Plenty of architectural differences can impact whether
# or not 128 bytes is superior alignment to 64 bytes, but
# considering the cost that this change introduces to the
# memory consumption of the loony queue object, it is
# recommended.

## Result types for the private
## advHead and advTail functions
Expand Down
2 changes: 1 addition & 1 deletion loony.nimble
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version = "0.3.0"
version = "0.3.1"
author = "cabboose"
description = "Fast mpmc queue with sympathetic memory behavior"
license = "MIT"
Expand Down
15 changes: 13 additions & 2 deletions loony/node.nim
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,17 @@ else:
template incEnqPathCounter*(): untyped = discard
template incDeqPathCounter*(): untyped = discard

template prn*(idx: uint16): uint16 =
## prn = 'Pro re nata' - when required
## Provides the actual index depending on
## if we are rotating the index or not.
when loonyRotate:
# multiply by cacheLineSize, mod by loonySlotCount
# then add idx*cacheLineSize/loonySlotCount
(idx shl lShiftBits) and (loonySlotCount - 1) or (idx shr rShiftBits)
else:
idx

template toNodePtr*(pt: uint | ptr Node): NodePtr =
# Convert ptr Node into NodePtr uint
cast[NodePtr](pt)
Expand Down Expand Up @@ -105,7 +116,7 @@ proc fetchAddSlot*(t: var Node, idx: uint16, w: uint, moorder: MemoryOrder): uin
## Remembering that the pointer has 3 tail bits clear; these are
## reserved and increased atomically to indicate RESUME, READER, WRITER
## statuship.
t.slots[idx].fetchAdd(w, order = moorder)
t.slots[prn idx].fetchAdd(w, order = moorder)

proc compareAndSwapNext*(t: var Node, expect: var uint, swap: uint): bool =
t.next.compareExchange(expect, swap, moRelease, moRelaxed)
Expand All @@ -131,7 +142,7 @@ proc allocNode*[T](pel: T): ptr Node =
proc tryReclaim*(node: var Node; start: uint16) =
block done:
for i in start..<N:
template s: Atomic[uint] = node.slots[i]
template s: Atomic[uint] = node.slots[prn i]
if (s.load(order = moAcquire) and CONSUMED) != CONSUMED:
var prev = s.fetchAdd(RESUME, order = moRelaxed) and CONSUMED
if prev != CONSUMED:
Expand Down
25 changes: 22 additions & 3 deletions loony/spec.nim
Original file line number Diff line number Diff line change
@@ -1,17 +1,36 @@
import std/atomics
import std/[atomics, math, strformat]

const
loonyNodeAlignment {.intdefine.} = 11
loonySlotCount {.intdefine.} = 1024
loonyNodeAlignment* {.intdefine.} = 11
loonySlotCount* {.intdefine.} = 1024

loonyIsolated* {.booldefine.} = false ## Indicate that loony should
## assert that all references passing through the queue have a single
## owner. Note that in particular, child Continuations have cycles,
## which will trigger a failure of this assertion.

loonyRotate* {.booldefine.} = true ## Indicate that loony should rotate
## the slots in the queue to avoid contention on the same cache line.
## This is useful when the queue is shared between multiple threads.
## Note that this will only work if the number of slots is a power of 2.

when loonyRotate:
# TODO Impl dynamic cache line size detection
const
cacheLineSize = 64
lShiftBits* = int log2(float cacheLineSize)
rShiftBits* = int(log2(float loonySlotCount)) - lShiftBits

static:
doAssert (1 shl loonyNodeAlignment) > loonySlotCount,
"Your LoonySlot count exceeds your alignment!"
doAssert loonySlotCount > 1,
"Your LoonySlot count must be greater than 1!"
when loonyRotate:
doAssert (loonySlotCount and (loonySlotCount - 1)) == 0,
fmt"Your LoonySlot count of {loonySlotCount} is not a power of 2!" &
" Either disable loonyRotate (-d:loonyRotate=false) or" &
" change the slot count."

const
## Slot flag constants
Expand Down
Loading