-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This change uses the ruler function to modify the TinyBLAS GEMM function on CPU so the roundoff error accumulation is always 10x less bad without sacrificing any performance. This is important for larger models such as Command-R+ which has dimensions as large as 26000 elements. For example: average error bits 0b000000000000000000000000xxxxxxxx naive 0b0000000000000000000000000000xxxx ruler 0b00000000000000000000000000000xxx kahan └──────┬───────┘ original fidelity worst case error bits 0b00000000xxxxxxxxxxxxxxxxxxxxxxxx naive 0b00000000000000xxxxxxxxxxxxxxxxxx ruler 0b0000000000000000xxxxxxxxxxxxxxxx kahan └──────┬───────┘ bf16 & f16 The new implementation uses a non-recursive divide-and-conquer technique for reducing dot products. It's not as good as Kahan summation, which we previously made available behind the --precise flag. However it seems to limit error growth nearly as well. This means that when you use the BF16 and F16 weights, llamafile will preserve the original fidelity. While an average case error of 233 ULP before may not seem like a big deal all it should take is a single worst case error to flip a single concept in the LLM's brain to sow confusion. So having a better guarantee here matters. float fsumf_ruler(const float *p, size_t n) { size_t i, sp = 0; int rule, step = 2; float stack[bsr(n / CHUNK + 1) + 1]; for (i = 0; i + CHUNK * 4 <= n; i += CHUNK * 4, step += 2) { float sum = 0; for (size_t j = 0; j < CHUNK * 4; ++j) sum += p[i + j]; for (rule = bsr(step & -step); --rule;) sum += stack[--sp]; stack[sp++] = sum; } float res = 0; while (sp) res += stack[--sp]; while (i < n) res += p[i++]; return res; } The reference impl for this algorithm is what I call ruler summation and it's a very fast way to sum a sequence of floating point numbers without too many errors, and superior performance under both IEEE and fast math. The only weird thing about this algorithm is that it requires a variable length array, but since it only takes up logarithmic space you should be able to run just about any LLM with on stack size of 64kb although we'll be increasing it to 128kb in llamafile just to be safe. See also https://oeis.org/A001511
- Loading branch information
Showing
8 changed files
with
203 additions
and
237 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.