From 9bc2c6cc2cd9c581dd35a9f748260887032c70dc Mon Sep 17 00:00:00 2001 From: Stefan Karpinski Date: Tue, 27 Feb 2024 13:03:16 -0500 Subject: [PATCH 1/3] jl_rng_split: fix linear possible correlation between four tasks In response to the demonstrated ease of constructing linearly correlated task states, and thus linearly related task RNG outputs, this change departs from using linear mixing for the compression function, which is simply dot product in those RNGs. I've shown that we can use any mixing function that's doubly bijective. This change uses `(2c+1)(2w+1)>>1` to mix the LCG state and the xoshiro register state and _then_ applies the PCG output function, which is designed to mask linearity. --- src/task.c | 341 ++++++++++++++++++++++------------------------------- 1 file changed, 143 insertions(+), 198 deletions(-) diff --git a/src/task.c b/src/task.c index a1e8dabb89f9d..27bbdec27778f 100644 --- a/src/task.c +++ b/src/task.c @@ -856,202 +856,143 @@ guaranteed to avoid collisions between the RNG streams of all tasks. The main RNG is the xoshiro256++ RNG whose state is stored in rngState[0..3]. There is also a small internal RNG used for task forking stored in rngState[4]. This state is used to iterate a linear congruential generator (LCG), which is then -put through four different variations of the strongest PCG output function, -referred to as PCG-RXS-M-XS-64 [1]. This output function is invertible: it maps -a 64-bit state to 64-bit output. This is one of the reasons it's not recommended -for general purpose RNGs unless space is at an absolute premium, but in our -usage invertibility is actually a benefit (as is explained below) and adding as -little additional memory overhead to each task object as possible is preferred. +combined with xoshiro256's state and put through four different variations of +the strongest PCG output function, referred to as PCG-RXS-M-XS-64 [1]. The goal of jl_rng_split is to perturb the state of each child task's RNG in -such a way each that for an entire tree of tasks spawned starting with a given -state in a root task, no two tasks have the same RNG state. Moreover, we want to -do this in a way that is deterministic and repeatable based on (1) the root -task's seed, (2) how many random numbers are generated, and (3) the task tree -structure. The RNG state of a parent task is allowed to affect the initial RNG -state of a child task, but the mere fact that a child was spawned should not -alter the RNG output of the parent. This second requirement rules out using the -main RNG to seed children: if we use the main RNG, we either advance it, which -affects the parent's RNG stream or, if we don't advance it, then every child -would have an identical RNG stream. Therefore some separate state must be -maintained and changed upon forking a child task while leaving the main RNG -state unchanged. - -The basic approach is that used by the DotMix [2] and SplitMix [3] RNG systems: -each task is uniquely identified by a sequence of "pedigree" numbers, indicating -where in the task tree it was spawned. This vector of pedigree coordinates is -then reduced to a single value by computing a dot product with a shared vector -of random weights. The weights are common but each pedigree of each task is -distinct, so the dot product of each task is unlikely to be the same. The DotMix -paper provides a proof that this dot product hash value (referred to as a -"compression function") is collision resistant in the sense that the pairwise -collision probability of two distinct tasks is 1/N where N is the number of -possible weight values. Both DotMix and SplitMix use a prime value of N because -the proof requires that the difference between two distinct pedigree coordinates -have a multiplicative inverse, which is guaranteed by N being prime since all -values are invertible then. We take a somewhat different approach: instead of -assigning n-ary pedigree coordinates, we assign binary tree coordinates to -tasks, which means that our pedigree vectors have only 0/1 and differences -between them can only be -1, 0 or 1. Since the only possible non-zero coordinate -differences are ±1 which are invertible regardless of the modulus, we can use a -modulus of 2^64, which is far easier and more efficient then using a prime -modulus. It also means that when accumulating the dot product incrementally, as -described in SplitMix, we don't need to multiply weights by anything, we simply -add the random weight for the current task tree depth to the parent's dot -product to derive the child's dot product. - -we instead limit pedigree coordinates to being binary, guaranteeing -invertibility regardless of modulus. When a task spawns a child, the parent and -child share the parent's previous pedigree prefix and the parent appends a zero -to its coordinates, which doesn't affect the task's dot product value, while the -child appends a one, which does produce a new dot product. In this manner a -binary pedigree vector uniquely identifies each task and since the coordinates -are binary, the difference between coordinates is always invertible: 1 and -1 -are their own multiplicative inverses regardless of the modulus. - -How does our assignment of pedigree coordinates to tasks differ from DotMix and -SplitMix? In DotMix and SplitMix, each task has a fixed pedigree vector that -never changes. The root tasks's pedigree is `()`, its first child's pedigree is -`(0,)`, its second child's pedigree is `(2,)` and so on. The length of a task's -pedigree tuple corresponds to how many ancestors tasks it has. Our approach -instead appends 0 to the parent's pedigree when it forks a child and appends 1 -to the child's pedigree at the same time. The root task starts with a pedigree -of `()` as before, but when it spawns a child, we update its pedigree to `(0,)` -and give its child a pedigree of `(1,)`. When the root task then spawns a second -child, we update its pedigree to `(0,0)` and give it's second child a pedigree -of `(0,1)`. If the first child spawns a grandchild, the child's pedigree is -changed from `(1,)` to `(1,0)` and the grandchild is assigned a pedigree of -`(1,1)`. In other words, DotMix and SplitMix build an n-ary tree where every -node is a task: parent nodes are higher up the tree and child tasks are children -in the pedigree tree. Our approach is to build a binary tree where only leaves -are tasks and each task spawn replaces a leaf in the tree with two leaves: the -parent moves to the left/zero leaf while the child is the right/one leaf. Since -the tree is binary, the pedigree coordinates are binary. - -It may seem odd for a task's pedigree coordinates to change, but note that we -only ever append zeros to a task's pedigree, which does not change its dot -product. So while the pedigree changes, the dot product is fixed. What purpose -does appending zeros like this serve if the task's dot product doesn't change? -Changing the pedigree length (which is also the binary tree depth) ensures that -the next child spawned by that task will have new and different dot product from -the previous child since it will have a different pseudo-random weight added to -the parent's dot product value. Whereas the pedigree length in DotMix and -SplitMix is unchanging and corresponds to how many ancestors a task has, in our -scheme the pedigree length corresponds to the number of ancestors *plus* -children a task has, which increases every time it spawns another child. - -We use the LCG in rngState[4] to generate pseudorandom weights for the dot -product. Each time a child is forked, we update the LCG in both parent and child -tasks. In the parent, that's all we have to do -- the main RNG state remains -unchanged. (Recall that spawning a child should *not* affect subsequent RNG -draws in the parent). The next time the parent forks a child, the dot product -weight used will be different, corresponding to being a level deeper in the -pedigree tree. In the child, we use the LCG state to generate four pseudorandom -64-bit weights (more below) and add each weight to one of the xoshiro256 state -registers, rngState[0..3]. If we assume the main RNG remains unused in all -tasks, then each register rngState[0..3] accumulates a different dot product -hash as additional child tasks are spawned. Each one is collision resistant with -a pairwise collision chance of only 1/2^64. Assuming that the four pseudorandom -64-bit weight streams are sufficiently independent, the pairwise collision -probability for distinct tasks is 1/2^256. If we somehow managed to spawn a -trillion tasks, the probability of a collision would be on the order of 1/10^54. -In other words, practically impossible. Put another way, this is the same as the -probability of two SHA256 hash values accidentally colliding, which we generally -consider so unlikely as not to be worth worrying about. - -What about the random "junk" that's in the xoshiro256 state registers from -normal use of the RNG? For a tree of tasks spawned with no intervening samples -taken from the main RNG, all tasks start with the same junk which doesn't affect -the chance of collision. The Dot/SplitMix papers even suggest adding a random -base value to the dot product, so we can consider whatever happens to be in the -xoshiro256 registers to be that. What if the main RNG gets used between task -forks? In that case, the initial state registers will be different. The DotMix -collision resistance proof doesn't apply without modification, but we can -generalize the setup by adding a different base constant to each compression -function and observe that we still have a 1/N chance of the weight value -matching that exact difference. This proves collision resistance even between -tasks whose dot product hashes are computed with arbitrary offsets. We can -conclude that this scheme provides collision resistance even in the face of -different starting states of the main RNG. Does this seem too good to be true? -Perhaps another way of thinking about it will help. Suppose we seeded each task -completely randomly. Then there would also be a 1/2^256 chance of collision, -just as the DotMix proof gives. Essentially what the proof is telling us is that -if the weights are chosen uniformly and uncorrelated with the rest of the -compression function, then the dot product construction is a good enough way to -pseudorandomly seed each task based on its parent's RNG state and where in the -task tree it lives. From that perspective, all we need to believe is that the -dot product construction is random enough (assuming the weights are), and it -becomes easier to believe that adding an arbitrary constant to each dot product -value doesn't make its randomness any worse. - -This leaves us with the question of how to generate four pseudorandom weights to -add to the rngState[0..3] registers at each depth of the task tree. The scheme -used here is that a single 64-bit LCG state is iterated in both parent and child -at each task fork, and four different variations of the PCG-RXS-M-XS-64 output -function are applied to that state to generate four different pseudorandom -weights. Another obvious way to generate four weights would be to iterate the -LCG four times per task split. There are two main reasons we've chosen to use -four output variants instead: - -1. Advancing four times per fork reduces the set of possible weights that each - register can be perturbed by from 2^64 to 2^60. Since collision resistance is - proportional to the number of possible weight values, that would reduce - collision resistance. While it would still be strong engough, why reduce it? - -2. It's easier to compute four PCG output variants in parallel. Iterating the - LCG is inherently sequential. PCG variants can be computed independently. All - four can even be computed at once with SIMD vector instructions. The C - compiler doesn't currently choose to do that transformation, but it could. - -A key question is whether the approach of using four variations of PCG-RXS-M-XS -is sufficiently random both within and between streams to provide the collision -resistance we expect. We obviously can't test that with 256 bits, but we have -tested it with a reduced state analogue using four PCG-RXS-M-XS-8 output -variations applied to a common 8-bit LCG. Test results do indicate sufficient -independence: a single register has collisions at 2^5 while four registers only -start having collisions at 2^20. This is actually better scaling of collision -resistance than we theoretically expect. In theory, with one byte of resistance -we have a 50% chance of some collision at 20 tasks, which matches what we see, -but four bytes should give a 50% chance of collision at 2^17 tasks and our -reduced size analogue construction remains collision free at 2^19 tasks. This -may be due to the next observation, which is that the way we generate -pseudorandom weights actually guarantees collision avoidance in many common -situations rather than merely providing collision resistance and thus is better -than true randomness. - -In the specific case where a parent task spawns a sequence of child tasks with -no intervening usage of its main RNG, the parent and child tasks are actually -_guaranteed_ to have different RNG states. This is true because the four PCG -streams each produce every possible 2^64 bit output exactly once in the full -2^64 period of the LCG generator. This is considered a weakness of PCG-RXS-M-XS -when used as a general purpose RNG, but is quite beneficial in this application. -Since each of up to 2^64 children will be perturbed by different weights, they -cannot have hash collisions. What about parent colliding with child? That can -only happen if all four main RNG registers are perturbed by exactly zero. This -seems unlikely, but could it occur? Consider the core of the output function: - - p ^= p >> ((p >> 59) + 5); - p *= m[i]; - p ^= p >> 43 - -It's easy to check that this maps zero to zero. An unchanged parent RNG can only -happen if all four `p` values are zero at the end of this, which implies that -they were all zero at the beginning. However, that is impossible since the four -`p` values differ from `x` by different additive constants, so they cannot all -be zero. Stated more generally, this non-collision property: assuming the main -RNG isn't used between task forks, sibling and parent tasks cannot have RNG -collisions. If the task tree structure is more deeply nested or if there are -intervening uses of the main RNG, we're back to relying on "merely" 256 bits of -collision resistance, but it's nice to know that in what is likely the most -common case, RNG collisions are actually impossible. This fact may also explain -better-than-theoretical collision resistance observed in our experiment with a -reduced size analogue of our hashing system. +such a way that for an entire tree of tasks spawned starting with a given root +task state, no two tasks have the same RNG state. Moreover, we want to do this +in a way that is deterministic and repeatable based on (1) the root task's seed, +(2) how many random numbers are generated, and (3) the task tree structure. The +RNG state of a parent task is allowed to affect the initial RNG state of a child +task, but the mere fact that a child was spawned should not alter the RNG output +of the parent. This second requirement rules out using the main RNG to seed +children: if we use the main RNG, we either advance it, which affects the +parent's RNG stream or, if we don't advance it, then every child would have an +identical RNG stream. Therefore some separate state must be maintained and +changed upon forking a child task while leaving the main RNG state unchanged. + +The basic approach is a generalization and simplification of that used in the +DotMix [2] and SplitMix [3] RNG systems: each task is uniquely identified by a +sequence of "pedigree" numbers, indicating where in the task tree it was +spawned. This vector of pedigree coordinates is then reduced to a single value +by computing a "dot product" with a shared vector of random weights. I write +"dot product" in quotes because what we use is not an actual dot product. The +linear dot product construction used in both DotMix and SplitMix was found by +@foobar_iv2 [4] to allow easy construction of linear relationships between the +main RNG states of tasks, which was in turn reflected in observable linear +relationships between the outputs of their RNGs. This relationship was between a +minimum of four tasks, so doesn't constitute a collision, per se, but is clearly +undesirable and highlights a hazard of the plain dot product construction. + +As in DotMix and SplitMix, each task is assigned unique task "pedigree" +coordiates. Our pedigree construction is a bit different and uses only binary +coordinates rather than arbitrary integers. Each pedigree is an infinite +sequence of ones and zeros with only finitely many ones. Each task has a "fork +index": the root task has index 0; the fork index of a task the jth child task +of a parent task with fork index i is i+j. The root task's coordinates are all +zeros; each child task's coordinates are the same as its parents except at its +fork index, where the parent has a zero while the child has a one. A task's +coordinates after its fork index are all zeros. The coordinates of a tasks +ancestors are all prefixes of its own coordinates, padded with zeros. + +Also as in DotMix and SplitMix, we generate a sequence of pseudorandom weights +to combine with the coordinates of each task. This sequence is common across all +tasks, and different mix values for each task derive from their coordinates +being different. In DotMix and SplitMix, this is a literal dot product: the +pseudorandom weights are multiplied by corresponding task coordinate and added +up. While this does provably make collisions as unlikely as randomly assigned +task seeds, this linear construction can be used to create linearly correlated +states between states. However, it turns out that the compression construction +need not be linear, commutative, associative, etc. which allows us to avoid any +linear or other obvious correlations between related sets of tasks. + +To generalize SplitMix's optimized dot product construction, we similarly +compute each task's compression function value incrementally by combining the +parent's compression value with pseudorandom weight corresponding with the +child's fork index. Formally, if the parent's compression value is c then we can +compute the child's compression value as c′ = f(c, wᵢ) where w is the vector of +pseudorandom weights. What is f? It can be any function that is bijective in +each argument for all values of the other argument: + + * For all c: w ↦ f(c, w) is bijective + * For all w: c ↦ f(c, w) is bijective + +The proof that these requirements are sufficient to ensure collision resistance +is in the linked discussion [4]. DotMix/SplitMix are a special case where f is +just addition. Instead we use a much less simple mixing function: + + 1. We use (2c+1)(2w+1)÷2 % 2^64 to mix the bits of c and w + 2. We then apply the PCG-RXS-M-XS-64 output function + +The first step thoroughly mixes the bits of the previous compression value and +the pseudorandom weight value using multiplication, which is non-commutative +with xoshiro's operations (xor, shift, rotate). This mixing function is a +bijection on each argument witnessed by these inverses: + + * c′ ↦ (2c′+1)(2w+1)⁻¹÷2 % 2^64 + * w′ ↦ (2c+1)⁻¹(2w′+1)÷2 % 2^64 + +The second PCG output step is a bijection and designed to be significantly +non-linear -- non-linear enough to mask the linearity of the LCG that drives the +PCG-RXS-M-XS-64 RNG and allows it to pass statistical RNG test suites despite +having the same size state and output. In particular, since this mixing function +is highly non-associative and non-linear, we (hopefully) don't have any +discernable relationship between these values: + + * c₀₀ = c + * c₁₀ = f(c, wᵢ) + * c₀₁ = f(c, wⱼ) + * c₁₁ = f(f(c, wᵢ), wⱼ) + +When f is simply `+` then these have a very obvious relationship: + + c₀₀ + c₁₁ == c₁₀ + c₀₁ + +This relationship holds regardless of what wᵢ and wⱼ are and is precisely what +allows easy creation of correlated tasks with the DotMix/SplitMix construction +that we previously used. Expressing any relationship between these values with +our mixing function would require inverting the PCG output function (doable but +non-trivial), knowing the weights wᵢ and wⱼ, and then applying the inversion +functions for those weights appropriately. Since the weights are pseudo-randomly +generated and not directly observable, this is infeasible. + +We maintain an LCG in rngState[4] to generate pseudorandom weights. An LCG by +itself is a very bad RNG, but we combine this one with xoshiro256 state +registers in a non-trivial way and then apply the PCG-RXS-M-XS-64 output +function to that. Even if the xoshiro256 states are all zeros, which they should +never be, the output would be the same as PCG-RXS-M-XS-64, which is a solid +statistical RNG. + +Each time a child is forked, we update the LCG in both parent and child tasks, +corresponding to increasing the fork index. In the parent, that's all we have to +do -- the main RNG state remains unchanged. Recall that spawning a child should +*not* affect subsequent RNG draws in the parent. The next time the parent forks +a child, the mixing weight used will be different. In the child, we use the LCG +state to perturb the child's main RNG state registers, rngState[0..3]. + +Since we want these registers to behave independently, we use four different +variations on f to mix the LCG state with each of the four main RNG registers. +Each variation first xors the LCG state with a different random constant before +combining that value above with the old register state via multiplication; the +PCG-RXS-M-XS-64 output function is then applied to that mixed state, with a +different multiplier constant for each variation / register index. Xor is used +in the first step since we multiply the result with the state immediately after +and multiplication distributes over `+` and commutes with `*`, which makes both +options suspect; multiplication doesn't distribute over or commute with xor. We +also use a different odd multiplier in PCG-RXS-M-XS-64 for each RNG register. +These three sources of variation (different xor constants, different xoshiro256 +state, different PCG multipliers) are sufficient for each of the four outputs to +behave statistically independently. [1]: https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf [2]: http://supertech.csail.mit.edu/papers/dprng.pdf [3]: https://gee.cs.oswego.edu/dl/papers/oopsla14.pdf + +[4]: +https://discourse.julialang.org/t/linear-relationship-between-xoshiro-tasks/110454 */ void jl_rng_split(uint64_t dst[JL_RNG_SIZE], uint64_t src[JL_RNG_SIZE]) JL_NOTSAFEPOINT { @@ -1060,26 +1001,30 @@ void jl_rng_split(uint64_t dst[JL_RNG_SIZE], uint64_t src[JL_RNG_SIZE]) JL_NOTSA src[4] = dst[4] = x * 0xd1342543de82ef95 + 1; // high spectrum multiplier from https://arxiv.org/abs/2001.05304 + // random xor constants static const uint64_t a[4] = { - 0xe5f8fa077b92a8a8, // random additive offsets... - 0x7a0cd918958c124d, - 0x86222f7d388588d4, - 0xd30cbd35f2b64f52 + 0x214c146c88e47cb7, + 0xa66d8cc21285aafa, + 0x68c7ef2d7b1a54d4, + 0xb053a7d7aa238c61 }; + // random odd multipliers static const uint64_t m[4] = { 0xaef17502108ef2d9, // standard PCG multiplier - 0xf34026eeb86766af, // random odd multipliers... + 0xf34026eeb86766af, 0x38fd70ad58dd9fbb, 0x6677f9b93ab0c04d }; // PCG-RXS-M-XS-64 output with four variants for (int i = 0; i < 4; i++) { - uint64_t p = x + a[i]; - p ^= p >> ((p >> 59) + 5); - p *= m[i]; - p ^= p >> 43; - dst[i] = src[i] + p; // SplitMix dot product + uint64_t c = src[i]; + uint64_t w = x ^ a[i]; + c += w*(2*c + 1); // c = (2c+1)(2w+1)÷2 % 2^64 (double bijection) + c ^= c >> ((c >> 59) + 5); + c *= m[i]; + c ^= c >> 43; + dst[i] = c; } } From 926e4fba14f94df17dbd3c829a738450c8798824 Mon Sep 17 00:00:00 2001 From: Stefan Karpinski Date: Tue, 27 Feb 2024 13:16:39 -0500 Subject: [PATCH 2/3] Update src/task.c --- src/task.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/task.c b/src/task.c index 27bbdec27778f..0e0bc50554877 100644 --- a/src/task.c +++ b/src/task.c @@ -903,7 +903,7 @@ being different. In DotMix and SplitMix, this is a literal dot product: the pseudorandom weights are multiplied by corresponding task coordinate and added up. While this does provably make collisions as unlikely as randomly assigned task seeds, this linear construction can be used to create linearly correlated -states between states. However, it turns out that the compression construction +states between tasks. However, it turns out that the compression construction need not be linear, commutative, associative, etc. which allows us to avoid any linear or other obvious correlations between related sets of tasks. From 57b42e4f4ebe93cf080ec90e623d719dd3173549 Mon Sep 17 00:00:00 2001 From: Stefan Karpinski Date: Tue, 27 Feb 2024 13:51:16 -0500 Subject: [PATCH 3/3] Apply suggestions from code review --- src/task.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/task.c b/src/task.c index 0e0bc50554877..7c8770530c326 100644 --- a/src/task.c +++ b/src/task.c @@ -886,7 +886,7 @@ minimum of four tasks, so doesn't constitute a collision, per se, but is clearly undesirable and highlights a hazard of the plain dot product construction. As in DotMix and SplitMix, each task is assigned unique task "pedigree" -coordiates. Our pedigree construction is a bit different and uses only binary +coordinates. Our pedigree construction is a bit different and uses only binary coordinates rather than arbitrary integers. Each pedigree is an infinite sequence of ones and zeros with only finitely many ones. Each task has a "fork index": the root task has index 0; the fork index of a task the jth child task @@ -938,7 +938,7 @@ non-linear -- non-linear enough to mask the linearity of the LCG that drives the PCG-RXS-M-XS-64 RNG and allows it to pass statistical RNG test suites despite having the same size state and output. In particular, since this mixing function is highly non-associative and non-linear, we (hopefully) don't have any -discernable relationship between these values: +discernible relationship between these values: * c₀₀ = c * c₁₀ = f(c, wᵢ)