numbits+base128 8-byte full-precision numbers #349

timbray · 2024-08-18T22:41:32Z

For background, see #336

The effect is ability to match full range of float64 and more compact representation, with significant speedup.

Signed-off-by: Tim Bray <tbray@textuality.com>

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite	Current: `b3b6934`	Previous: `78e2ec8`	Ratio
`BenchmarkCityLots`	`11146` ns/op 1785 B/op 104 allocs/op	`5592` ns/op 773 B/op 31 allocs/op	`1.99`
`BenchmarkCityLots - ns/op`	`11146` ns/op	`5592` ns/op	`1.99`
`BenchmarkCityLots - B/op`	`1785` B/op	`773` B/op	`2.31`
`BenchmarkCityLots - allocs/op`	`104` allocs/op	`31` allocs/op	`3.35`
`Benchmark_JsonFlattner_Evaluate_ContextFields`	`986.9` ns/op 72 B/op 8 allocs/op	`726.2` ns/op 56 B/op 4 allocs/op	`1.36`
`Benchmark_JsonFlattner_Evaluate_ContextFields - ns/op`	`986.9` ns/op	`726.2` ns/op	`1.36`
`Benchmark_JsonFlattner_Evaluate_ContextFields - B/op`	`72` B/op	`56` B/op	`1.29`
`Benchmark_JsonFlattner_Evaluate_ContextFields - allocs/op`	`8` allocs/op	`4` allocs/op	`2`

This comment was automatically generated by workflow using github-action-benchmark.

timbray · 2024-08-18T22:46:32Z

BTW there's something wrong with the benchmarks - my local benchmarks which measure matches/second rather than ns/op have slightly improved in this merge. Need to investigate.

timbray · 2024-08-19T03:29:38Z

numbits.go

+
+// float64 are stored as (sign | exponent | mantissa)
+// with 1 bit sign, 11 bits exponent, 52 bits mantissa
+const ()


Oops, editing error here didn't notice the empty const()

codecov-commenter · 2024-08-19T04:46:11Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 97.29730% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.47%. Comparing base (db62f53) to head (b3b6934).
Report is 2 commits behind head on main.

Files	Patch %	Lines
value_matcher.go	86.66%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
- Coverage   96.59%   96.47%   -0.13%     
==========================================
  Files          19       20       +1     
  Lines        1940     1899      -41     
==========================================
- Hits         1874     1832      -42     
- Misses         37       39       +2     
+ Partials       29       28       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tim Bray <tbray@textuality.com>

timbray · 2024-08-20T00:17:00Z

Thanks again @arnehormann, the variable-length numbits work perfectly. The numbits type is still long of course, but its toUTF8() method does the shortening. I've verified that for the “common” numbers, it produces much shorter slices.

We currently don't have a good benchmark to demonstrate a performance improvement, because all of our number benchmarks are exploring corner cases with weird huge/tiny/high-precision numbers, not just ordinary numbers like the kind that would actually be used. For the output of rand.Float64(), the average length is ~9.7 but those are definitely not everyday numbers.

But I'm confident that in practice this will give higher performance.

ah-quant · 2024-08-20T09:48:22Z

numbers.go

-// numbers, and cryptographic keys/signatures. For these, treatment as strings seems to produce
-// satisfactory results for equality testing.
+// in text data or the internal floating-point bits. Therefore, we map floating-point numbers
+// (which is what JSON numbers basically are) to comparable strings which preserve the ordering


drop first "ordering"

Yep.

Also, thanks a lot for the review. I lost reviewers when all my PRs were digging around inside the NFA implementation, which is really very difficult to review.

ah-quant · 2024-08-20T09:50:27Z

case_folding.go

@@ -3,475 +3,475 @@ package quamina
 // Code generated by code_gen/code_gen - DO NOT EDIT.
 // built from the "C" records in CaseFolding.txt in the Unicode character database
 var caseFoldingPairs = map[rune]rune{


This change should not be part of the PR, it's unrelated (right?)

This is a symptom of a git mistake. The lint checker changed in GitHub but I hadn't updated my local copy, so all of a sudden there were these CI failures. So I did PR #348 that addressed just the lint problems. But I neglected that I also had those in my working copy where I was working with numbits. So they showed as a diff in this PR, even though they are already in the origin repo. I tried to fix with git pull but no luck. So you are correct that this is unrelated but it isn't really in the PR. If you have advice how to manipulate with git to avoid this, that would be good.

You could do a soft reset, add everything you need selectively and commit then. followed by a force push to overwrite the branch. and do all that on a new local branch to make sure you can still revert back to it if you stumble somewhere.
But there's a simpler, hacky approach:

git fetch origin/main --prune
git checkout -b tmp origin/main
git checkout numbits --
git add
git commit
git push -f origin tmp:numbits

That's probably nicer to navigate than with revert etc - directly pick the files you want in a new branch instead of wrangling commits.
requires a bit of cleanup because your local numbits branch will be different to the server one and you should get rid of the tmp branch. But I don't want to add the commands to do destructive operations on your local repo.

If you add stuff and you have multiple changes in your repo, add files individually. add -p can also be used to only add regions in a file. And then, take care to not add -a to commit - that will drag everything else back in.

Since I don't understand what's really going on in that sequence, unless you think it's really damaging, I think I'd prefer to live with the noisy PR. I'm a git idiot.

ah-quant · 2024-08-20T09:51:41Z

code_gen/build_casefolding_table.go

@@ -43,21 +43,21 @@ func main() {
 	}
 	resp, err := http.Get(CaseFoldingURL)
 	if err != nil {
-		fatalF("Can't fetch CaseFolding.txt: " + err.Error())
+		fatal("Can't fetch CaseFolding.txt: " + err.Error())


... also unrelated. Which of course depends on how you want to handle it, but they seem to be added to the commit by accident.
But starting here, I won't remark on that again :-)
Also, I won't review tests to reduce the volume a bit.

Exactly. Added to the commit by mistake. Not sure how to remove.

ah-quant · 2024-08-20T10:02:45Z

flattener.go

@@ -58,5 +58,5 @@ type Field struct {
 	Path       []byte
 	Val        []byte
 	ArrayTrail []ArrayPos
-	IsQNumber  bool
+	isNumber   bool


No change needed, just an api-design question:
Why the change to unexported? It doesn't have to be exported, but since the struct and all other fields and the previous version are all exported, it stands out.
If quamina is used as a library, it could be nice to now and it's probably more relevant for clients than e.g. ArrayTrail.

Oh, just a mistake, should be exported.

ah-quant · 2024-08-20T10:03:29Z

numbers.go

+// (which is what JSON numbers basically are) to comparable strings which preserve the ordering
+// numbers' ordering. Versions of Quamina up to 1.3 used a home-grown format which used 14 hex digits
+// to represent a subset of numbers. This has now been replaced by Arne Hormann's "numbits"
+// construct, see numbits.go. It uses 10 base128 bytes to represent the entire range of float64 numbers.


change "10" to "up to 10"?

ah-quant · 2024-08-20T10:04:08Z

numbers.go

-	for i, utf8Byte := range buf {
-		if i == 0 {
-			continue
+/*


I want to keep this around because it's very useful for debugging and I will definitely need it when I'm implementing the range construct. I guess I should un-comment and use in a test so that I don't get lint complaints about unused function. [Edit] Actually I uncommented it and it's already used in a test implicitly by a %s in fmt.Printf().

ah-quant · 2024-08-20T10:15:03Z

numbits.go

+// toUTF8 turns a numbits into 10 bytes of UTF-8 encoded via Base-256
+// code copied with thanks from a sample by Axel Wagner
+func (nb numbits) toUTF8() []byte {
+	var b [10]byte


If performance critical, this could be unrolled as shown in #334 (comment) - both versions do the same big endian base128 encoding.

be128 := []byte { byte(num >> 57) & 0x7f, // 57 bits remaining byte(num >> 50) & 0x7f, // 50 bits remaining byte(num >> 43) & 0x7f, // 43 bits remaining byte(num >> 36) & 0x7f, // 36 bits remaining byte(num >> 29) & 0x7f, // 29 bits remaining byte(num >> 22) & 0x7f, // 22 bits remaining byte(num >> 15) & 0x7f, // 15 bits remaining byte(num >> 8) & 0x7f, // 8 bits remaining byte(num >> 1) & 0x7f, // 1 bit remaining byte(num & 1) << 6, // last bit (high bit of lower 7 bits) }

and "math/bits" can be used to slice it: (bits.TrailingZeros64(uint64(nb)) + 6) / 7 can determine the end without looping.

I prefer the loop construct, I think it's roughly as readable and takes less screen space. Or am I missing something?

As for TrailingZeros64, I think there's a problem, which is the case when the numbits reduces to just [0] (the smallest possible number I guess). Because removing all the trailing zeroes would reduce this to a n empty slice and the automaton needs at least one byte to match on. That's why the current loop has k>0 not >=0. But, I should comment that because it's non-obvious.

numbit 0 is NaN, cannot happen in JSON ☺️

Heh, OK. But I better say that in a comment…

also, base 256 would be a plain byte. it's big endian base 128 (only the lower 7 bits are used for all bytes, byte order is big endian).

Oh you're right. Off by 1.

Anyhow, I tried out bits.TrailingZeros64 but the code was a bit awkward and the +6 / 7 nonintuitive, but I did go back and optimize a bit; first count back after 0-byte encodings, and leave everything set up for a separate loop that allocates just the right size and copies them in.

it still says Base-256 in the comment

I also made a mistake, there. the +6/7 is only needed if we go from left to right.
It should have been be128[:len(be128)-bits.TrailingZeros64(uint64(nb))/7]

Signed-off-by: Tim Bray <tbray@textuality.com>

timbray · 2024-08-26T23:40:42Z

Wow, I really need to update my local copy of the linter. Surprisingly, I think that lint failure is an actual bug; I will try to prove that with a unit test.

arnehormann · 2024-08-27T04:26:18Z

... and just fyi (expectation management): this variable length scheme will only compress positive integers and power-of-two fractions. It won't do much for negative values because they had to be negated (xored with one bits) in the process to make them lexically sortable.
Still worth it, it doesn't cost much time and no additional bytes at all.

timbray · 2024-08-27T17:15:12Z

Haha, you're unduly pessimistic. numbits are integer-friendly. For the ints 0 to 100,000 inclusive, here are the counts of shortened-numbits lengths:

None higher than 5. All <5 until you get over 1K. But you are correct, negative numbers all seem to be length 10

This is actually very good because in real life, many (most?) of the numbers that get matched are integers, not too large.

arnehormann · 2024-08-27T18:17:34Z

as I said: only positive integers and power of two fractions. but positive values can be as little as 2 bytes 😁
edit... right, it can even be 1 byte 😃

Signed-off-by: Tim Bray <tbray@textuality.com>

timbray · 2024-08-27T21:39:14Z

BTW on that last commit I neglected to add the most recent version of numbits, which has an optimized shortening that causes a tiny performance improvement. A little short of sleep these days…

numbits+base128 8-byte full-precision numbers

ea4edc5

Signed-off-by: Tim Bray <tbray@textuality.com>

github-actions bot reviewed Aug 18, 2024

View reviewed changes

timbray commented Aug 19, 2024

View reviewed changes

timbray added 2 commits August 19, 2024 09:54

fix up comments and README

587fcf3

Signed-off-by: Tim Bray <tbray@textuality.com>

make UTF-8 version of numbits variable-length

5f1d460

Signed-off-by: Tim Bray <tbray@textuality.com>

ah-quant reviewed Aug 20, 2024

View reviewed changes

Address feedback from Arne, rewrite varwidth numbits

8098767

Signed-off-by: Tim Bray <tbray@textuality.com>

fix lint, use latest numbits.go

b3b6934

Signed-off-by: Tim Bray <tbray@textuality.com>

timbray merged commit cd8d31a into main Aug 28, 2024
7 checks passed

timbray deleted the numbits branch August 28, 2024 21:20

timbray mentioned this pull request Sep 10, 2024

Proposal for better number representation #334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numbits+base128 8-byte full-precision numbers #349

numbits+base128 8-byte full-precision numbers #349

timbray commented Aug 18, 2024

github-actions bot left a comment •

edited

Loading

timbray commented Aug 18, 2024

timbray Aug 19, 2024 •

edited

Loading

codecov-commenter commented Aug 19, 2024 •

edited

Loading

timbray commented Aug 20, 2024 •

edited

Loading

ah-quant Aug 20, 2024

timbray Aug 20, 2024

ah-quant Aug 20, 2024

timbray Aug 20, 2024

arnehormann Aug 20, 2024

timbray Aug 20, 2024

ah-quant Aug 20, 2024

timbray Aug 20, 2024

ah-quant Aug 20, 2024

timbray Aug 20, 2024

ah-quant Aug 20, 2024

timbray Aug 20, 2024

ah-quant Aug 20, 2024

timbray Aug 20, 2024 •

edited

Loading

ah-quant Aug 20, 2024 •

edited

Loading

timbray Aug 20, 2024

arnehormann Aug 20, 2024

timbray Aug 20, 2024

arnehormann Aug 20, 2024

timbray Aug 20, 2024

timbray Aug 26, 2024

arnehormann Aug 27, 2024

arnehormann Aug 27, 2024 •

edited

Loading

timbray commented Aug 26, 2024

arnehormann commented Aug 27, 2024

timbray commented Aug 27, 2024

arnehormann commented Aug 27, 2024 •

edited

Loading

timbray commented Aug 27, 2024

numbits+base128 8-byte full-precision numbers #349

numbits+base128 8-byte full-precision numbers #349

Conversation

timbray commented Aug 18, 2024

github-actions bot left a comment • edited Loading

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

timbray commented Aug 18, 2024

timbray Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Aug 19, 2024 • edited Loading

Codecov Report

timbray commented Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbray Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

ah-quant Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnehormann Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

timbray commented Aug 26, 2024

arnehormann commented Aug 27, 2024

timbray commented Aug 27, 2024

arnehormann commented Aug 27, 2024 • edited Loading

timbray commented Aug 27, 2024

github-actions bot left a comment •

edited

Loading

timbray Aug 19, 2024 •

edited

Loading

codecov-commenter commented Aug 19, 2024 •

edited

Loading

timbray commented Aug 20, 2024 •

edited

Loading

timbray Aug 20, 2024 •

edited

Loading

ah-quant Aug 20, 2024 •

edited

Loading

arnehormann Aug 27, 2024 •

edited

Loading

arnehormann commented Aug 27, 2024 •

edited

Loading