-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the flattener smarter about skipping fields #113
Comments
Replying to @timbray last comment on the former issue:
I am not sure how we can implement this part of Have you looked on my proposed paths tree data-structure? maintaining will be fairly simple as I already have done that (didn't considered deletions yet), it will be handed to the Flattener which can use this information to understand which fields to pluck (agnostic to JSON, can be used in other formats as well). I conducted some benchmarks, and in my case of events like: {
"context": { /* small list of properties, <10 */ },
"payload": { /* large object, nested one here */ }
} I am doing pattern matches against |
Here's a question: Is it true that this optimization only works when the fields that you want to match are concentrated in the leading bytes of the event data? Maybe do a benchmark to see if jx is still helpful when the matching fields are randomly distributed or in the trailing bytes? If so, then the answer to the question is No and that is super interesting and jx style should probably replace the current If yes, then we need to figure out what data needs to be passed to the jx-style scanner to get the optimization. My first suggestion was to identify some field at which the scanner can safely stop but I'm sure there are other good ideas. Quamina has a |
Oops, looks like we were writing comments in parallel… I don't quite understand the paths tree idea, in your example are you restricting the flattening to the |
Anyhow, looking forward to seeing benchmark results. |
Just to make sure I understand this properly, Anyhow, what you are suggesting I am going in the benchmarks I wrote above -
In the benchmarks I post on the original post, this what I did, but now, I am doing benchmarks to understand how is the performance when I am checking against 4 cases:
I'll post the benchmarks soon (today/tomorrow), I just want to find public data-set that I can use instead of my internal data. |
Ran the benchmarks of the 3 cases above with public data (I found from simdjson repository) Benchmarks are here: yosiat@0dcf35f
To run, checkout this branch locally and run - If you want me to conduct more benchmarks it will be easy now with the helpers I made, just let me know. Context fieldsFields: Flattening only:
Matching:
Middle nested fieldField: Flattening only:
Matching:
Last fieldField: Flattening only:
Matching:
|
I have done some basic investigation on existing flattener given those benchmarks, and got some interesting finds:
I got to go sleep, but I am bit optimistic that I can some huge wins with reducing allocations and increasing the performance of flattener by actually skipping values instead of reading them. Will update probably tomorrow or over the weekend with some results. |
Yeah. The core idea of flattenJSON is that the whole event is a []byte and we want to have Field structures that also have []byte for the field value, so it tries to just make a sub-slice []byte for the values, which should only require allocating a slice structure. But we need the field values to be pure UTF-8 so we can run a byte-level state machine, so for any field that contains JSON \-escaping, we have to allocate a buffer and de-escape the JSON bytes into it. That simdjson sample data has lots of \-escapes in most fields so a lot of allocating. The other thing flattenJSON does is enrich the You might find it helpful to read https://www.tbray.org/ongoing/When/202x/2022/06/10/Quamina-Optimizing to understand what flattenJSON does. Sorry, it's kind of long and has a lot of philosophical stuff at the front, but does eventually get into flattenJSON.
My problem is I don't know how to bypass a string in JSON without looking at each byte to find the terminating Now I'm super curious how jx manages to allocate so little and go so fast. Thanks for this work, it's valuable. |
This makes sense, I think we should concentrate on not doing this escaping unless the field is required, which is the second point.
I read this article (and other Quamina articles) and it was interesting read! I was thinking about the
You can't skip a JSON string (or any other value) without scanning each byte, the idea is not to perform any The whole skipping logic is in this code: https://github.com/go-faster/jx/blob/main/dec_skip.go, more specifically for strings they have I think we need the same kind of methods in our flattener as well, so we have two modes (and we already have
What I am thinking right now, is a more a question for you Tim, do we want to invest more time in existing flattener and try mimicking Jx logic and have better skipping and efficiency around traversing, or we can just use Jx? I understand there is a dependency question around here. If dependency (and it's dependencies) are the issue, maybe we can find a Jx replacement which don't have dependencies? If you say, you want to invest in existing flattener and not use Jx, I'll totally accept it and will invest my time (I have good vacation coming up soon ;) ) in improving existing flattener. |
That's a really good idea, look at
After that, if jx is still a lot faster, we should face the question of whether to take a dependency on jx or maybe just steal some of its code (assuming its license allows that). I really do think that Quamina will have a better future without dependencies. |
Sounds acceptable! I'll do a POC with existing |
Hey! So implemented quick to code skipping on string values (yosiat@bd0c943) -
And we are way better in terms of allocations! I am concentrating on
Now I am pretty optimistic that if:
We will be way faster and we won't need any external the dependencies. {
"context": {
"name": ["yosi"]
},
"payload": {
"name": ["attias"]
}
} Our
What do you think @timbray ? would love to hear your ideas here! |
Excellent! Although I will always be interested in the CityLots as well, but you already knew that. I'll have a close look at
I'm glad to hear that, but not surprised. The existing flatten code is heavily profiled and I would be surprised if the alternative was much faster.
Ah, this is very interesting. You notice that At the time, our conclusion was that "segments appearing in multiple paths" were rare enough that we get better performance on average with the current approach. Now, it's possible that our range of test data was weird because it was mostly AWS control-plane stuff and if you look at that you find very few places where you have a replicated segment as part of a field that you would want to match on. By the way, congratulations, you have now become one of the few people in the world who is an expert on flattening hierarchical data structures. |
Hmm, now I'm thinking about this. When the events are large, the benefit of being able to skip large parts of it is very important, and might outweigh the benefit of using segment caching. An idea that I'm going to save here to think about, not sure if it's good: How about special treatment for top-level fields? I.e. if the Pattern is like { "context": { "name": [ "yogi" ] } } Then we could remember that the only top-level field worth looking at is context, and cheaply skip all the others? |
I think that we won't see too much difference in CityLots, because we compared Jx and it wasn't faster, since this is a benchmark that does pattern on the whole payload. Once we will get closer to PR I'll run CityLots to make sure I didn't introduced any regression in that case as well.
I think that I actually solved that before with Jx and "Paths Index", this code I never constructed the full path at flattening time since I traversed the JSON graph together with my "Paths Index" graph. The question of my "Paths Index" is memory usage with large amount of patterns (that query different paths, worst case, but needs to be checked).
Thanks a lot :) I am doing this work because I think Quamina can replace something I purposely built and our payloads have the same structure as benchmarks:
So I am thinking about my (near future) use-case of Quamina, I'll have two cases:
So I'll need both cases to be fast. |
OK, sounds like you have a clear path forward. Looking forward to seeing a PR. One remark (maybe I made before?): Seems reasonable to add some sort of config API to Flattener, telling it explicitly what can be skipped. |
So I think I will make two PRs for making it easier to review:
About the first PR, |
We experimented with various directory layouts and on recommendation of @embano1 ended up with the current flat structure, which is thought to be OK for libraries. There is the factor that Flattener is an interface and it is quite likely that in the future we would have multiple flatteners for different media types (Avro, protobufs, etc) so maybe that's a reason? Micha, got an opinion? Personally I don't dislike big files too much but I'm not religious on the subject. |
Oh, a question: What's the value of a separate implementation for readNumber and readLiteral and so on? They never do allocation, I think? |
Good comments. In general, I would not break up/design (package) structure by size but rather behavior. When we started there was only the JSON flattener and it worked well here. I guess, now that we're thinking about different implementations and a future-proof package layout, it makes sense to break up the flat hierarchy a bit, e.g. introducing a We could to a refactor PR and see if the following is intuitive for users and does not introduce circular deps: |- <quamina root>
|- flatten
|- json
|- jsonskip
|- protobuf Not sure if it makes sense to group the |
Went over all the values reading (together with arrays & objects) and the only skipping we need are strings, so I'll push a PR only for that. Regarding the multiple files/folders, I'll retract for now, since I am not sure there is a full need for that and I agree with @embano1 suggestion for the future when we will have multiple flatteners to have separate directories. |
Made some interesting progress! Implemented a functionality that will allow us skipping blocks (arrays & objects), it have no affect on allocations, but have a dramatic effect on runtime.
I'll work on making this as a pull request (tests, coverage, refactors) and will post it soon (today/tomorrow) Update: Posted a PR: #119 |
Improved in c117dab |
Continuing the discussion from this issue: #108
Currently the flattener is fast and performs well on cases where the searched fields (from the patterns) are covering most of the event payload, in cases of large payloads and the searched fields are small subset of it, the flattener can be improved.
Proposed Solution and POC results
My proposed solution consists of maintaining paths index being in use (as a tree data structure) and POC flattener written using Jx.
The source resides here - https://github.com/yosiat/quamina/tree/flat-hack inside "paths.go" and "jx.go", it passes most of the existing flattener tests except the error handling.
Benchmark results:
source: #108 (comment)
Those benchmarks can't be shared currently since they are using an internal data, but can be adjusted and shared if needed.
BigShellStyle benchmark (in
benchmarks_test.go
) -And in CityLots benchmarks where we are pattern matching against the whole event data, we see there is no difference -
Open Questions & Notes
The text was updated successfully, but these errors were encountered: