Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numbits+base128 8-byte full-precision numbers #349

Merged
merged 5 commits into from
Aug 28, 2024
Merged

numbits+base128 8-byte full-precision numbers #349

merged 5 commits into from
Aug 28, 2024

Conversation

timbray
Copy link
Owner

@timbray timbray commented Aug 18, 2024

For background, see #336

The effect is ability to match full range of float64 and more compact representation, with significant speedup.

Signed-off-by: Tim Bray <tbray@textuality.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: b3b6934 Previous: 78e2ec8 Ratio
BenchmarkCityLots 11146 ns/op 1785 B/op 104 allocs/op 5592 ns/op 773 B/op 31 allocs/op 1.99
BenchmarkCityLots - ns/op 11146 ns/op 5592 ns/op 1.99
BenchmarkCityLots - B/op 1785 B/op 773 B/op 2.31
BenchmarkCityLots - allocs/op 104 allocs/op 31 allocs/op 3.35
Benchmark_JsonFlattner_Evaluate_ContextFields 986.9 ns/op 72 B/op 8 allocs/op 726.2 ns/op 56 B/op 4 allocs/op 1.36
Benchmark_JsonFlattner_Evaluate_ContextFields - ns/op 986.9 ns/op 726.2 ns/op 1.36
Benchmark_JsonFlattner_Evaluate_ContextFields - B/op 72 B/op 56 B/op 1.29
Benchmark_JsonFlattner_Evaluate_ContextFields - allocs/op 8 allocs/op 4 allocs/op 2

This comment was automatically generated by workflow using github-action-benchmark.

@timbray
Copy link
Owner Author

timbray commented Aug 18, 2024

BTW there's something wrong with the benchmarks - my local benchmarks which measure matches/second rather than ns/op have slightly improved in this merge. Need to investigate.

numbits.go Outdated

// float64 are stored as (sign | exponent | mantissa)
// with 1 bit sign, 11 bits exponent, 52 bits mantissa
const ()
Copy link
Owner Author

@timbray timbray Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, editing error here didn't notice the empty const()

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 97.29730% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.47%. Comparing base (db62f53) to head (b3b6934).
Report is 2 commits behind head on main.

Files Patch % Lines
value_matcher.go 86.66% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
- Coverage   96.59%   96.47%   -0.13%     
==========================================
  Files          19       20       +1     
  Lines        1940     1899      -41     
==========================================
- Hits         1874     1832      -42     
- Misses         37       39       +2     
+ Partials       29       28       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tim Bray <tbray@textuality.com>
Signed-off-by: Tim Bray <tbray@textuality.com>
@timbray
Copy link
Owner Author

timbray commented Aug 20, 2024

Thanks again @arnehormann, the variable-length numbits work perfectly. The numbits type is still long of course, but its toUTF8() method does the shortening. I've verified that for the “common” numbers, it produces much shorter slices.

We currently don't have a good benchmark to demonstrate a performance improvement, because all of our number benchmarks are exploring corner cases with weird huge/tiny/high-precision numbers, not just ordinary numbers like the kind that would actually be used. For the output of rand.Float64(), the average length is ~9.7 but those are definitely not everyday numbers.

But I'm confident that in practice this will give higher performance.

numbers.go Outdated
// numbers, and cryptographic keys/signatures. For these, treatment as strings seems to produce
// satisfactory results for equality testing.
// in text data or the internal floating-point bits. Therefore, we map floating-point numbers
// (which is what JSON numbers basically are) to comparable strings which preserve the ordering

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop first "ordering"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Also, thanks a lot for the review. I lost reviewers when all my PRs were digging around inside the NFA implementation, which is really very difficult to review.

@@ -3,475 +3,475 @@ package quamina
// Code generated by code_gen/code_gen - DO NOT EDIT.
// built from the "C" records in CaseFolding.txt in the Unicode character database
var caseFoldingPairs = map[rune]rune{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should not be part of the PR, it's unrelated (right?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a symptom of a git mistake. The lint checker changed in GitHub but I hadn't updated my local copy, so all of a sudden there were these CI failures. So I did PR #348 that addressed just the lint problems. But I neglected that I also had those in my working copy where I was working with numbits. So they showed as a diff in this PR, even though they are already in the origin repo. I tried to fix with git pull but no luck. So you are correct that this is unrelated but it isn't really in the PR. If you have advice how to manipulate with git to avoid this, that would be good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do a soft reset, add everything you need selectively and commit then. followed by a force push to overwrite the branch. and do all that on a new local branch to make sure you can still revert back to it if you stumble somewhere.
But there's a simpler, hacky approach:

git fetch origin/main --prune
git checkout -b tmp origin/main
git checkout numbits --
git add
git commit
git push -f origin tmp:numbits

That's probably nicer to navigate than with revert etc - directly pick the files you want in a new branch instead of wrangling commits.
requires a bit of cleanup because your local numbits branch will be different to the server one and you should get rid of the tmp branch. But I don't want to add the commands to do destructive operations on your local repo.

If you add stuff and you have multiple changes in your repo, add files individually. add -p can also be used to only add regions in a file. And then, take care to not add -a to commit - that will drag everything else back in.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't understand what's really going on in that sequence, unless you think it's really damaging, I think I'd prefer to live with the noisy PR. I'm a git idiot.

@@ -43,21 +43,21 @@ func main() {
}
resp, err := http.Get(CaseFoldingURL)
if err != nil {
fatalF("Can't fetch CaseFolding.txt: " + err.Error())
fatal("Can't fetch CaseFolding.txt: " + err.Error())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... also unrelated. Which of course depends on how you want to handle it, but they seem to be added to the commit by accident.
But starting here, I won't remark on that again :-)
Also, I won't review tests to reduce the volume a bit.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Added to the commit by mistake. Not sure how to remove.

flattener.go Outdated
@@ -58,5 +58,5 @@ type Field struct {
Path []byte
Val []byte
ArrayTrail []ArrayPos
IsQNumber bool
isNumber bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed, just an api-design question:
Why the change to unexported? It doesn't have to be exported, but since the struct and all other fields and the previous version are all exported, it stands out.
If quamina is used as a library, it could be nice to now and it's probably more relevant for clients than e.g. ArrayTrail.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just a mistake, should be exported.

numbers.go Outdated
// (which is what JSON numbers basically are) to comparable strings which preserve the ordering
// numbers' ordering. Versions of Quamina up to 1.3 used a home-grown format which used 14 hex digits
// to represent a subset of numbers. This has now been replaced by Arne Hormann's "numbits"
// construct, see numbits.go. It uses 10 base128 bytes to represent the entire range of float64 numbers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "10" to "up to 10"?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

numbers.go Outdated
for i, utf8Byte := range buf {
if i == 0 {
continue
/*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop?

Copy link
Owner Author

@timbray timbray Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep this around because it's very useful for debugging and I will definitely need it when I'm implementing the range construct. I guess I should un-comment and use in a test so that I don't get lint complaints about unused function. [Edit] Actually I uncommented it and it's already used in a test implicitly by a %s in fmt.Printf().

numbits.go Outdated
// toUTF8 turns a numbits into 10 bytes of UTF-8 encoded via Base-256
// code copied with thanks from a sample by Axel Wagner
func (nb numbits) toUTF8() []byte {
var b [10]byte
Copy link

@ah-quant ah-quant Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance critical, this could be unrolled as shown in #334 (comment) - both versions do the same big endian base128 encoding.

be128 := []byte {
  byte(num >> 57) & 0x7f, // 57 bits remaining
  byte(num >> 50) & 0x7f, // 50 bits remaining
  byte(num >> 43) & 0x7f, // 43 bits remaining
  byte(num >> 36) & 0x7f, // 36 bits remaining
  byte(num >> 29) & 0x7f, // 29 bits remaining
  byte(num >> 22) & 0x7f, // 22 bits remaining
  byte(num >> 15) & 0x7f, // 15 bits remaining
  byte(num >> 8) & 0x7f, //  8 bits remaining
  byte(num >> 1) & 0x7f, //  1 bit  remaining
  byte(num & 1) << 6, // last bit (high bit of lower 7 bits)
}

and "math/bits" can be used to slice it: (bits.TrailingZeros64(uint64(nb)) + 6) / 7 can determine the end without looping.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the loop construct, I think it's roughly as readable and takes less screen space. Or am I missing something?

As for TrailingZeros64, I think there's a problem, which is the case when the numbits reduces to just [0] (the smallest possible number I guess). Because removing all the trailing zeroes would reduce this to a n empty slice and the automaton needs at least one byte to match on. That's why the current loop has k>0 not >=0. But, I should comment that because it's non-obvious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numbit 0 is NaN, cannot happen in JSON ☺️

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, OK. But I better say that in a comment…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, base 256 would be a plain byte. it's big endian base 128 (only the lower 7 bits are used for all bytes, byte order is big endian).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. Off by 1.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, I tried out bits.TrailingZeros64 but the code was a bit awkward and the +6 / 7 nonintuitive, but I did go back and optimize a bit; first count back after 0-byte encodings, and leave everything set up for a separate loop that allocates just the right size and copies them in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still says Base-256 in the comment

Copy link
Collaborator

@arnehormann arnehormann Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made a mistake, there. the +6/7 is only needed if we go from left to right.
It should have been be128[:len(be128)-bits.TrailingZeros64(uint64(nb))/7]

Signed-off-by: Tim Bray <tbray@textuality.com>
@timbray
Copy link
Owner Author

timbray commented Aug 26, 2024

Wow, I really need to update my local copy of the linter. Surprisingly, I think that lint failure is an actual bug; I will try to prove that with a unit test.

@arnehormann
Copy link
Collaborator

... and just fyi (expectation management): this variable length scheme will only compress positive integers and power-of-two fractions. It won't do much for negative values because they had to be negated (xored with one bits) in the process to make them lexically sortable.
Still worth it, it doesn't cost much time and no additional bytes at all.

@timbray
Copy link
Owner Author

timbray commented Aug 27, 2024

Haha, you're unduly pessimistic. numbits are integer-friendly. For the ints 0 to 100,000 inclusive, here are the counts of shortened-numbits lengths:

1: 1
2: 1
3: 115
4: 7590
5: 92294

None higher than 5. All <5 until you get over 1K. But you are correct, negative numbers all seem to be length 10

This is actually very good because in real life, many (most?) of the numbers that get matched are integers, not too large.

@arnehormann
Copy link
Collaborator

arnehormann commented Aug 27, 2024

as I said: only positive integers and power of two fractions. but positive values can be as little as 2 bytes 😁
edit... right, it can even be 1 byte 😃

Signed-off-by: Tim Bray <tbray@textuality.com>
@timbray
Copy link
Owner Author

timbray commented Aug 27, 2024

BTW on that last commit I neglected to add the most recent version of numbits, which has an optimized shortening that causes a tiny performance improvement. A little short of sleep these days…

@timbray timbray merged commit cd8d31a into main Aug 28, 2024
7 checks passed
@timbray timbray deleted the numbits branch August 28, 2024 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants