-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: add support for a new compact float64 representation #336
Conversation
... resolves #334 |
ugh. Test errors due to Go 1.19 used although go.mod specifies 1.20. I need at least 1.20 for this code. Will try to sort it out tomorrow |
Thanks for this! I'm in the middle of implementing |
One note to myself or whoever… we are going to need a decent benchmark to exhibit the runtime performance impact of moving from Q numbers to numbits. Right now our test that makes heaviest use of numeric matching is |
ISTM the only reason to use Go 1.20 here is that you use |
@Merovius good point. I'll change it. |
20e9f7c
to
1ceb8b6
Compare
Okay. Rebased on main, copied compare code from cmp in Go 1.20, added a new NumbitsFromBinaryString func to sidestep problems with slice to array conversions not available in Go 1.19 and replaced the tests Compare invocations with those of the copied code. Parts of that might have be done more elegantly, but this seems to work. |
Also @timbray: this is not integrated anywhere, yet. If you still require the "invalid UTF-8 bytes" property, the Numbits can be "compressed" after calling Normalize(). Oh, also... I'm calling it "Numbits". The pattern is obvious enough (at least in hindsight), it probably existed before. Still, I did not know it and did not search a lot. I just hope the name is not too far off... if required, we can rename it later to align with the rest of the world. |
877be83
to
6a81b27
Compare
@timbray I made the required changes. Can you trigger the CI again? I didn't do that myself for GitHub actions yet, but this looks like the relevant docs: https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs |
8d1175f
to
87f4576
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #336 +/- ##
==========================================
+ Coverage 96.59% 96.64% +0.04%
==========================================
Files 19 20 +1
Lines 1940 1967 +27
==========================================
+ Hits 1874 1901 +27
Misses 37 37
Partials 29 29 ☔ View full report in Codecov by Sentry. |
Closes: 334 Numbits can be constructed from float64 and can be losslessly converted back to float64. The current fast branchless conversion is possible due to a nerd-snipe of @Merovius. He also threw it a godbolt and gave it some scrutiny. Thanks, Axel!
So what I think I should do is clone this branch and use that as a workspace to see if I can (a) integrate numbits with the finite-automaton code and (b) evaluate any performance differences. The Go benchmarking tool claims to track memory allocations, which is difficult; I wonder how accurate that is? I have not used that, will investigate. The result of replacing 14-byte with 8-byte quantities should be un-subtle. @arnehormann in your view is this now stable, or is there anything else you want to do before I grab it? |
@timbray I am very very happy with this now and love the api. One thing that could be added is the proposal of the encoding Axel described, but that could also live in another file. And there might always be more tests - but the existing ones should cover pretty much everything including exotic edge cases. And memory tracked by benchmarks is very usable and accurate. It might help to wed number decoding very closely to this code. That can and should also happen externally, I chose an api that pretty much has no error cases and is a poster child of "make bad states not representable". But due to the no-error approach, it expects a float64 instead of a []byte of decimal utf8 number string currently. Or the binary version (post conversion) as [8]byte or string. |
OK. The bad news is that, while the API is nice, Quamina needs very little of it. My thinking now is that when flattenJSON encounters a number, we will still use |
That's exactly right. And if -0 is possible after parsing numbers, also Normalize(). And you have to store the Bytes() result in a variable before you can slice it with [:] (has to be addressable). If used in more places, there can also be a convenience func combining it. But it all inlines well and the assembly looks good. And I made it to be usable with more than JSON. |
It turns out that JSON allows -0 but Quamina patterns don't, because the Go JSON tokenizer doesn't. Is that a bug I think? Hm, I guess we need a policy decision. If I make a pattern {"x": [ -0 ]} Should that match this event? {"x": 0} or vice versa? The answer is not obvious to me. In Go (and most programming languages I think?) -0.0==0.0. But presumably if someone creates such a pattern, they care about the sign. |
Concerning comparison (which is of higher interest to the matcher I think), both should be equal - at least according to https://en.m.wikipedia.org/wiki/Signed_zero ... concerning representation, they are distinct. But think of your blog post that started me on this - 1e1 vs 10 vs 10.0 etc are the same number. |
OK, so my conclusion is that, since Go's built-in json tokenizer apparently doesn't support -0 and JSON itself doesn't support NaN, I never need to call |
sounds good. As long as it's for both paths, patterns and events. |
Hi Tim, I just checked - that assumption is flawed, Normalize is still required (well, at least the -0 handling part of it is): https://go.dev/play/p/q0SdRoPJUMp Only the compilers automatically converts -0 to 0; the runtime behavior is different. |
(should be getting back to this next week) We'd need a unit test to show the potential problem with -0, because at the moment I don't think one can get through either Go's JSON reader or Quamina's custom parser. |
OK, I have cloned Arne's branch and will replace qNumber with numbits and do some benchmarking.
|
Now I'm leaning against using numbits. First of all, as I think about redesigning But there's another problem which is tiny and unlikely to happen that I don't know how to fix. Suppose I have the rule {"x": ["foo"] } which is intended to match only JSON data that has a top-level {"x":-77936130622565622861113617120445728215253349187385567291125040555718211817982934064273708603570865020116281354366693784516070188526811668807680} Because its numbits representation is (Don’t ask me how I found that number, it’s too painful. And I’ve never seen a binary search iterate 600 times before.] The reason why? Because we use 0xF5, which can never appear in UTF-8, as an end-of-data signal. We add it to the automata for all the matching values and also to each value when we match it. It simplifies a lot of code because you never have to check whether you're at the end of the input data. But that byte value can of course appear in a Numbits. The annoying thing is that this will almost certainly never happen, the Numbits that could possibly match strings are only these very huge negative values, this one is approximately -7.793613e+142. But it would be very hard to explain if a user managed to encounter it. My mind isn’t 100% made up. If I could work around the second problem here I'd be more motivated to go after the redesign-smallTable problem. I will think some more. |
This could be fixed with the strategy outlined by Axel in #334 (comment) and following. |
OK, that's sensible, I'll have a look. Staying inside UTF-8 bounds is a massive simplification. So we'd get arbitrary precision and simultaneously reduce numeric fields from 14 bytes per numeric field to 10. Sorry, you mentioned #334 twice, did you mean to mention something else the second time? |
No, I did not. Sorry. I initially considered linking his follow-up comment, but base128 is the best choice here and probably not overly difficult to write - or copy and adapt from Axels playground link. |
Ugh. That was wrong. The second link should have been https://pkg.go.dev/encoding/ascii85@go1.23.0 - but I'd still prefer base128. |
I agree. I just finished implementing and unit-testing Numbits.toUTF8(), stealing from Axel's code, and it seems flawless and fast. I think this is now going to be easy. Wow, that base128 text is ugly, but no human should ever see it. |
Sometimes Numbits.Float64() can return NaN func TestBrokenFLoat(t *testing.T) {
var nb Numbits = 18445811164593300620
f := nb.Float64()
if math.IsNaN(f) {
t.Error("NaN")
}
} Is this a problem? This Numbits was created with |
I don't think it's a problem. NaN and infinity are representable and can be created this way, but will never occur with valid json inputs and patterns. |
That's my belief too, so I won't let it worry me. |
FYI: I have finished up replacing Quamina's 14-hex-byte Qnumber construct with 10-byte numbits+base128. I still have a bunch of work to do on the README and docs before I can PR, but here are some numbers. I created a new benchmark, Using the Go benchmarking, here are before and after results.
After:
So, that's pretty good! |
(Although TBH I don't like that each step is allocating this much memory, must investigate…) |
Thanks for this idea, which turns out to have been very fruitful. |
Numbits can be constructed from float64 and can be losslessly converted back to float64.
Doing so required upgrading go.mod to 1.20.
The code is not used yet and still has to be integrated. I can try to work on that, but it will probably be faster if somebody else (Tim 😇 ) does it.
The current fast branchless conversion is possible due to a nerd-snipe of @Merovius. He also threw it a godbolt and gave it some scrutiny. Thanks, Axel!