chore: flattenJson: efficient strings, arrays and objects skip. #119

yosiat · 2022-10-04T14:38:09Z

Implement a way to efficiently skip values, in this way we are just scanning the buffer until we find the end of string/array/object, and don't check inner values.

benchmark                                                old ns/op     new ns/op     delta
Benchmark_JsonFlattener_ContextFields-10                 17312         9053          -47.71%
Benchmark_JsonFlattener_MiddleNestedField-10             17426         10713         -38.52%
Benchmark_JsonFlattener_LastField-10                     17060         9553          -44.00%
Benchmark_JsonFlattner_Evaluate_ContextFields-10         17645         9495          -46.19%
Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10     17935         11152         -37.82%
Benchmark_JsonFlattner_Evaluate_LastField-10             17475         9882          -43.45%

benchmark                                                old allocs     new allocs     delta
Benchmark_JsonFlattener_ContextFields-10                 23             3              -86.96%
Benchmark_JsonFlattener_MiddleNestedField-10             24             4              -83.33%
Benchmark_JsonFlattener_LastField-10                     22             2              -90.91%
Benchmark_JsonFlattner_Evaluate_ContextFields-10         33             13             -60.61%
Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10     33             13             -60.61%
Benchmark_JsonFlattner_Evaluate_LastField-10             29             9              -68.97%

benchmark                                                old bytes     new bytes     delta
Benchmark_JsonFlattener_ContextFields-10                 560           48            -91.43%
Benchmark_JsonFlattener_MiddleNestedField-10             576           64            -88.89%
Benchmark_JsonFlattener_LastField-10                     544           32            -94.12%
Benchmark_JsonFlattner_Evaluate_ContextFields-10         992           480           -51.61%
Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10     1000          488           -51.20%
Benchmark_JsonFlattner_Evaluate_LastField-10             824           312           -62.14%

See discussion here: #113 (comment)

yosiat · 2022-10-04T15:01:19Z

flattener_bench_test.go

@@ -0,0 +1,156 @@
+package quamina


Add the benchmarks I am presenting, I think it will be good to commit those (if you agree I'll move status.json to testdata directory)

What do you think?

+1 yes please.

Cool! will make a proper commit for that.

do we want to run the benchmarks in CI? I am not sure about it, since I don't github actions will give us good results and will take some time to run the benchmarks.

See issue #30 - one problem is that GitHub runs actions single-threaded on a pathetic tiny instance. So we probably can't routinely run them all. This bothers me a bit because I want to make sure that nothing gets checked in that accidentally causes a noticeable slowdown.

Implement a way to efficiently skip values, in this way we are just scanning the buffer until we find the end of string/array/object, and don't check inner values. ``` benchmark old ns/op new ns/op delta Benchmark_JsonFlattener_ContextFields-10 17312 9053 -47.71% Benchmark_JsonFlattener_MiddleNestedField-10 17426 10713 -38.52% Benchmark_JsonFlattener_LastField-10 17060 9553 -44.00% Benchmark_JsonFlattner_Evaluate_ContextFields-10 17645 9495 -46.19% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 17935 11152 -37.82% Benchmark_JsonFlattner_Evaluate_LastField-10 17475 9882 -43.45% benchmark old allocs new allocs delta Benchmark_JsonFlattener_ContextFields-10 23 3 -86.96% Benchmark_JsonFlattener_MiddleNestedField-10 24 4 -83.33% Benchmark_JsonFlattener_LastField-10 22 2 -90.91% Benchmark_JsonFlattner_Evaluate_ContextFields-10 33 13 -60.61% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 33 13 -60.61% Benchmark_JsonFlattner_Evaluate_LastField-10 29 9 -68.97% benchmark old bytes new bytes delta Benchmark_JsonFlattener_ContextFields-10 560 48 -91.43% Benchmark_JsonFlattener_MiddleNestedField-10 576 64 -88.89% Benchmark_JsonFlattener_LastField-10 544 32 -94.12% Benchmark_JsonFlattner_Evaluate_ContextFields-10 992 480 -51.61% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 1000 488 -51.20% Benchmark_JsonFlattner_Evaluate_LastField-10 824 312 -62.14% ```

timbray

Nice! My only worry is about skipBlock but maybe I'm missing something obvious?

I notice in the coverage report that the coverage went down a bit: https://app.codecov.io/github/timbray/quamina/commit/9b8031b158a40424d90574f38d4993e140c67b48

I'm not a fanatic about coverage percent but when it goes down that always makes me nervous. I looked at the actual coverage red/green and didn't see anything uncovered that looked very worrying, but please have a glance at the coverage and see if there's anything that surprises you.

timbray · 2022-10-04T18:12:50Z

flatten_json.go

@@ -475,6 +496,60 @@ func (fj *flattenJSON) readLiteral(literal []byte) ([]byte, error) {
 	return literal, nil
 }

+func (fj *flattenJSON) skipBlock(openSymbol byte, closeSymbol byte) error {


This func makes me nervous. Normally when we're going through the buffer we use fj.step() and fj.ch(), is there a reason why this is different?

Also, nesting; Suppose you have an array (arguments [, ]) and one of the elements is an object, possibly deeply nested? Also, how about an object ({, }) and a member value is an array.

Also, nesting; Suppose you have an array (arguments [, ]) and one of the elements is an object, possibly deeply nested? Also, how about an object ({, }) and a member value is an array.

Nesting is covered here with the levels, you can see that we are skeeping most of status.json which contains nested objects with arrays.

I'll add specific tests for that, to validate skipping of objects & arrays.

Normally when we're going through the buffer we use fj.step() and fj.ch(), is there a reason why this is different?

This is actually an interesting case! I wrote the code initially without the step and ch and when I used those methods the performance (time taken, not allocations) increased, talking about something like:

for { ch := fj.ch() // process if fj.step() != nil { return err } }

So I tried to investigate this, it didn't make any sense, since this methods are inlined and cheap, so I did manual inlining and got no improvements.

Finally what I found that iterating over the fj.event with fj.eventIndex is slower than -

data = fj.event[fj.eventIndex:] i = 0 // iterating from i until len(data)

I am not sure why iterating from the middle to from the start is slower, I was thinking about trying to check it with compiler explorer and see what's the difference .. I'll try that but I am not an assembly expert.

Huh... agree, makes no sense to me. data is now a new []byte against the same backing memory, maybe Go has some magic about indexing into []byte when the index is small? Anyhow, if this is repeatable, maybe we should have a fancier version of step() and ch() that rebases the slice and then all the parts of flattenJSON get the benefit?

Also, unless the improvement is dramatic I'm a little reluctant to have one piece of the big complicated function that arbitrarily has its own process for traversing the buffer.

Anyhow, if this is repeatable, maybe we should have a fancier version of step() and ch() that rebases the slice and then all the parts of flattenJSON get the benefit?

I was thinking about giving an option to have an iterator over the event that will hand you the current character and it might does what you are saying, but didn't want to introduce big change like that in this PR.

Also, unless the improvement is dramatic I'm a little reluctant to have one piece of the big complicated function that arbitrarily has its own process for traversing the buffer.

The improvement was dramatic, I'll post benchmarks differences tomorrow and lets get a decision based on that.

The improvement was dramatic, I'll post benchmarks differences tomorrow and lets get a decision based on that.

Wow, given that ch() and step() are inlined, that is really weird (also they didn’t show up in profiling output). Anyhow, there's a lesson hiding in here somewhere and once we figure it out we should give all of flattener the benefit. I agree, probably another PR for that.

Update on benchmarks, I rewrote skipStringValue and skipBlock with step & ch -

benchmark old ns/op new ns/op delta Benchmark_JsonFlattener_ContextFields-10 9062 12211 +34.75% Benchmark_JsonFlattener_MiddleNestedField-10 10730 13619 +26.92% Benchmark_JsonFlattener_LastField-10 9570 12598 +31.64% Benchmark_JsonFlattner_Evaluate_ContextFields-10 9487 12607 +32.89% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 11122 14021 +26.07% Benchmark_JsonFlattner_Evaluate_LastField-10 9873 12893 +30.59% benchmark old allocs new allocs delta Benchmark_JsonFlattener_ContextFields-10 3 3 +0.00% Benchmark_JsonFlattener_MiddleNestedField-10 4 4 +0.00% Benchmark_JsonFlattener_LastField-10 2 2 +0.00% Benchmark_JsonFlattner_Evaluate_ContextFields-10 13 13 +0.00% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 13 13 +0.00% Benchmark_JsonFlattner_Evaluate_LastField-10 9 9 +0.00% benchmark old bytes new bytes delta Benchmark_JsonFlattener_ContextFields-10 48 48 +0.00% Benchmark_JsonFlattener_MiddleNestedField-10 64 64 +0.00% Benchmark_JsonFlattener_LastField-10 32 32 +0.00% Benchmark_JsonFlattner_Evaluate_ContextFields-10 480 480 +0.00% Benchmark_JsonFlattner_Evaluate_MiddleNestedField-10 488 488 +0.00% Benchmark_JsonFlattner_Evaluate_LastField-10 312 312 +0.00%

I tried to write a representative benchmark - https://gist.github.com/yosiat/930426f4baea92b5f188203c6acb716b and it doesn't makes sense to me, I checked with https://godbolt.org/ and it looks that step instructions are the same as for code in it's relevant parts 🤔

timbray · 2022-10-04T18:15:08Z

flatten_json.go

+	for i < len(data) {
+		c := data[i]
+
+		if c == '\\' && i+1 < len(data) && data[i+1] == '"' {


I'd put a short comment here explaining why you can skip all the other \\ escapes, because the only thing you care about is an unescaped "

yosiat · 2022-10-04T18:41:42Z

Thanks for the speedy review!

I'm not a fanatic about coverage percent but when it goes down that always makes me nervous. I looked at the actual coverage red/green and didn't see anything uncovered that looked very worrying, but please have a glance at the coverage and see if there's anything that surprises you.

This is actually interesting! I checked the coverage and increased coverage actually when I ran only TestFJ tests from 24%~ to 30%~.

I'll give it a look, it's important that we will have a (almost) perfect coverage for parsing JSON.

I'll push commits to address the changes you are raising and before merging I'll do some rebases with squashes (so it will go with the commit conventions, and will be concentrated together).

embano1 · 2022-10-05T06:58:54Z

Regarding benchmarks: yeah it's tricky, but we could take a look at benchmark-action/github-action-benchmark and see if it provides value for us?

yosiat · 2022-10-05T10:12:50Z

Regarding benchmarks: yeah it's tricky, but we could take a look at benchmark-action/github-action-benchmark and see if it provides value for us?

This looks really awesome!! Will be interesting to use it here, and get alerts on PRs when we have a regression. I'll do it in PR immediately after this PR is merged, and let's see what we think about the results?

By the way, I think it will be interesting to have CityLots benchmark as go benchmark, I tried doing so, but the results included reading the city lots file for some reason, I'll give another round of look.

codecov-commenter · 2022-10-05T10:18:36Z

Codecov Report

Base: 94.48% // Head: 94.08% // Decreases project coverage by -0.40% ⚠️

Coverage data is based on head (f6b7e9c) compared to base (4b37ecc).
Patch coverage: 82.81% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   94.48%   94.08%   -0.41%     
==========================================
  Files          16       16              
  Lines        1849     1909      +60     
==========================================
+ Hits         1747     1796      +49     
- Misses         71       78       +7     
- Partials       31       35       +4

Impacted Files	Coverage Δ
flatten_json.go	`94.09% <82.81%> (-1.36%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

yosiat · 2022-10-05T10:58:34Z

@timbray the PR is ready for another round of review.

Regarding code coverage, the comment says there is a decrease, but when I open https://app.codecov.io/gh/timbray/quamina/pull/119 there is an increase of 0.17% and for flatten_json.go the increase is 0.42%.

About the commits, once you approve I'll merge, but before merging I'll squash the commits (manually with interactive rebase) and will do force-push.

timbray · 2022-10-05T16:58:31Z

When I look at the codecov report, it says the coverage is up: https://app.codecov.io/github/timbray/quamina/commit/76952717c23cb3f1cc04d1f9f9a61dd23876d876

OK, will review today.

timbray · 2022-10-05T17:36:56Z

flatten_json.go

@@ -475,6 +489,65 @@ func (fj *flattenJSON) readLiteral(literal []byte) ([]byte, error) {
 	return literal, nil
 }

+func (fj *flattenJSON) skipBlock(openSymbol byte, closeSymbol byte) error {


So skipBlock sort of works, but I think its behavior is a little troubling. Try this unit test:

func TestSkipBlock(t *testing.T) { blocks := []string{ `[ { foo ] ]`, `{ "foo": }, "bar": 1 }"`, } for _, block := range blocks { f := newJSONFlattener() fj := f.(*flattenJSON) fj.event = []byte(block) fj.eventIndex = 0 var err error if block[0] == '[' { err = fj.skipBlock('[', ']') } else if block[0] == '{' { err = fj.skipBlock('{', '}') } if err == nil { fmt.Printf("for %s block=%s\n", block, string(fj.event[fj.eventIndex:])) } else { fmt.Printf("for %s err=%s\n", block, err.Error()) } } }

Is this a problem?

skipBlock comes with a trade-off, that we won't validate the JSON during parsing.. so if we get a block (which we are not interesting about for matching) and contains invalid data we won't err.

To me, it makes sense, it would be problematic if it's a block that we are interesting in and it's not valid.

Another perspective is about early skipping that we want to implement next..
consider: event which is { "context": ..., "payload": ... } and we have patterns only on context fields, in that case once we fetch them all we won't read payload.

Now in that case, payload can be invalid json, for example: { "context": { "id": 1 }, "payload": { "a": }.

OK, I'll be honest, the notion that we would fail to detect broken JSON was sort of shocking to me, and remember that, because of Hyrum's Law, this becomes part of our contract with the world. So:

We need a change to README.md describing this behavior. I just checked and the discussion of error-checking in MatchesForJSONEvent() is already weak, so I'll give myself an action item to take care of that.

I'll approve the PR but I'd like to request that you do a little investigation and figure out why ch() and step() seem to be slowing things down, i.e. why the code in skipBlock looks different from the rest of flatten_json.go. OK? If you can't I will, but at this moment you have all this swapped into your head.

I did an investigation, It would be nice if you can enable "Discussions" feature so we can talk about it over there.

I ran some benchmarks, playing around and published the results and the code here: https://github.com/yosiat/quotes-count

The reason that step & ch are slower is essentially because they are on the heap, while the other example is on stack. Because of obvious reasons we can move flattenJson to be on stack.

The interesting stuff I found is:

_QuotesCount_Heap_For_EventIndex is doing for eventIndex < len(event) and incrementing eventIndex, on M1 is shows performance benefit while on Intel it doesn't.

_QuotesCount_Heap_Next - Instead of doing for without conditions, I added a next method which just checks the eventIndex - on M1 it's faster, but on Intel it's slower - so we can't use this trick.

I checked other json parsers on how they managing their state, and found that it splits:

Most of them is doing the same as flattenJson - referencing their state with heap pointers.

Others have each method accept the rest of payload to process and return an offset, so for example, reading an array will be offset, value, err := fj.readArray(fj.event[i:]) and then upon success do i += offset , so in both methods the index is on the stack, same as the event.

I can do POC of the second option to see how much it improves the performance, but it will be have worse DX and be more error prone. so I am not sure if the cost (performance vs DX) will be worth it.

What do you think?

OK, Discussions now enabled and I launched one. I have some questions about your most recent - want to replicate this over in the discussion and I'll follow-up there?

Cool! replicated to: #122 with some context added for new folks.

As suggested here: timbray#119 (comment)

chore: introduce benchmarks

3c8b8b6

yosiat marked this pull request as ready for review October 4, 2022 15:00

yosiat commented Oct 4, 2022

View reviewed changes

yosiat force-pushed the skipper branch from eab09e2 to 9b8031b Compare October 4, 2022 15:02

yosiat requested a review from timbray October 4, 2022 15:02

yosiat mentioned this pull request Oct 4, 2022

Make the flattener smarter about skipping fields #113

Closed

timbray requested changes Oct 4, 2022

View reviewed changes

yosiat added 4 commits October 4, 2022 21:43

Move status.json to testdata

67ea9c2

add comment in skipStringValue

4f2c49b

Add tests for skipping arrays & objects

5581a54

Fix formatting issue for flatten_json_test.go

e73c855

yosiat added 2 commits October 5, 2022 13:14

Reanme benchmarks file

4c02429

Remove skipping of blocks & strings inside arrays

f6b7e9c

more coverage for skipping errors

7695271

yosiat requested a review from timbray October 5, 2022 10:58

timbray reviewed Oct 5, 2022

View reviewed changes

timbray approved these changes Oct 5, 2022

View reviewed changes

timbray merged commit 08c5f54 into timbray:main Oct 5, 2022

yosiat deleted the skipper branch October 6, 2022 21:25

yosiat mentioned this pull request Oct 11, 2022

Segments Tree Tracking #124

Merged

5 tasks

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

670b470

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

418b198

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

d57b348

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

8669154

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

18300d4

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 12, 2022

Introduce benchmarks action

56af859

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 29, 2022

Introduce benchmarks action

fb91a24

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 29, 2022

Introduce benchmarks action

cc35a74

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 30, 2022

Introduce benchmarks action

bd594c1

As suggested here: timbray#119 (comment)

yosiat added a commit to yosiat/quamina that referenced this pull request Oct 30, 2022

Introduce benchmarks action

358cd74

As suggested here: timbray#119 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: flattenJson: efficient strings, arrays and objects skip. #119

chore: flattenJson: efficient strings, arrays and objects skip. #119

yosiat commented Oct 4, 2022

yosiat Oct 4, 2022

timbray Oct 4, 2022

yosiat Oct 4, 2022 •

edited

Loading

timbray Oct 4, 2022

timbray left a comment

timbray Oct 4, 2022

yosiat Oct 4, 2022

timbray Oct 4, 2022

yosiat Oct 4, 2022

timbray Oct 4, 2022

yosiat Oct 5, 2022

timbray Oct 4, 2022

yosiat Oct 4, 2022

yosiat commented Oct 4, 2022 •

edited

Loading

embano1 commented Oct 5, 2022 •

edited

Loading

yosiat commented Oct 5, 2022 •

edited

Loading

codecov-commenter commented Oct 5, 2022

yosiat commented Oct 5, 2022

timbray commented Oct 5, 2022

timbray Oct 5, 2022

yosiat Oct 5, 2022

yosiat Oct 5, 2022

timbray Oct 5, 2022

yosiat Oct 6, 2022

timbray Oct 6, 2022

yosiat Oct 6, 2022

chore: flattenJson: efficient strings, arrays and objects skip. #119

chore: flattenJson: efficient strings, arrays and objects skip. #119

Conversation

yosiat commented Oct 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosiat Oct 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosiat commented Oct 4, 2022 • edited Loading

embano1 commented Oct 5, 2022 • edited Loading

yosiat commented Oct 5, 2022 • edited Loading

codecov-commenter commented Oct 5, 2022

Codecov Report

yosiat commented Oct 5, 2022

timbray commented Oct 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosiat Oct 4, 2022 •

edited

Loading

yosiat commented Oct 4, 2022 •

edited

Loading

embano1 commented Oct 5, 2022 •

edited

Loading

yosiat commented Oct 5, 2022 •

edited

Loading