-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object Insertion Rework #4830
Object Insertion Rework #4830
Conversation
I'm thinking it may make sense to introduce a new type ObjectKeysIterator interface {
Next() (*Term, bool)
Value() *Term
Reset()
} This would still allow incremental iteration across the keys of an object, in a predictable order (assuming the implementation allows for that!), and weakens/modifies the contract on the implementation of the key storage from "must be indexable with an integer" to "Object must support providing an iterator over its keys". Since these are going to be stateful iterators, the concrete types can (and should!) keep state information internally, and NOT expose that to the callers. For example, under the current implementation, the state might be an integer index, but in the future, the state might be a pointer to the next linked list element. 😄 My plan with this new type would be to revert my changes using |
So I was just beginning to think about what we'd need to do keep the evaluator intact without |
31d20bd
to
a48ad07
Compare
So, after some thought, I've realized I have 2 implementation paths I can go down for the linked-list data structure changes, each of which has tradeoffs. @tsandall, @srenatus if you two have any guidance here, let me know, because I'm leaning towards Option2, as that one keeps memory usage net-neutral, is only a little more work to do, and hopefully won't kill performance for most users. Option 1 - Naive linked-listThis approach replaces the slice of Pro:
Con:
Option 2 - Linked-list for keys, optimized objectElem structure for valuesThis approach will split the storage of keys and values. Keys will be stored hanging off the linked-list of keys, and values will be stored exactly as before in the values map. The Pro:
Neutral:
Con:
|
Well, I tried an initial stab at Option 1, the "naive linked-list" approach, and it... did more harm than good. 😬 It makes sense, now that I think about it. 🤔 Locally, these are the results I'm getting (compare with earlier results, which took 15s total):
My intuition suggests that insertion sort, as well as linear iteration across the keys in several spots are likely causing a good chunk of the slowdown, and a tree-based data structure might be worth trying out instead, since that would greatly improve lookup times where it matters. I'm going to do some more digging to see exactly what the culprit is before investing a lot of dev time into anything though. |
After some more thought (and a lot of research into tree data structures), a possibility that occurred to me while explaining this issue to @charlesdaniels was that instead of making exotic data structure changes... what if we changed when we do the sorting of the keys? Option 3 - Lazy sorting for keysThis approach will modify how and when keys are sorted for Pro:
Con:
History NoteRelevant PRs:
It looks like this approach would directly revert historical changes around when/where key sorting happens, which is not ideal, and suggests that I will need to do a good bit of investigating before actually implementing this approach fully. 🤔 |
5bd4249
to
17ee2fb
Compare
I've investigated making an
The main advantage of doing the key sorting lazily is that it will provide a massive win on asymptotics for insertions, and for most other situations will not make the asymptotics dramatically worse (I have a table of asymptotics I worked out for the basic operations and common amortized use cases, available on request). The first "read" of the keys after a sequence of insertions will be expensive (O(n log n) to O(N^2)), but all successive reads will be O(1). This change also opens the door to more performance tunables in the future, such as using an adaptive sorting algorithm (e.g. Timsort) that could make the occasional re-sorting of keys much faster/more memory efficient than the default Golang sort (which is just Quicksort under the hood). |
17ee2fb
to
574f623
Compare
574f623
to
ef2ed51
Compare
82cb44c
to
463dc22
Compare
New Benchmark Results!String/MarshalJSON benchmarkThe new tl;dr: This benchmark shows a 40-60% degradation in
This PR:
Insert/Get benchmarkThe tl;dr: This benchmark shows about a 1-5% performance difference for the new PR for smaller Object sizes (<5,000 keys), and around a 10% performance degradation for the PR branch at 5,000 keys (I suspect this could still be mostly within measurement errors at these key counts). However! Once we get into the 100k+ keys range things get very interesting. I had to manually tweak the benchmark size for the The PR branch however, continues to scale almost linearly in performance up through 500k keys. 😃
This PR (500k keys):
Comments:
|
463dc22
to
cbefe72
Compare
@srenatus When you're satisfied with the new benchmark tests and other code aspects you commented on, I think we're ready for merge! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final round of nitpicks -- LGTM otherwise. Feel free to merge (ideally with some of the nitpicks addressed 😉)
cbefe72
to
daa2f65
Compare
This commit delays the sorting of keys until just-before-use. This is a net win on asymptotics as Objects get larger, even with Quicksort as the sorting algorithm. This commit also adjusts the evaluator to use the new ObjectKeysIterator interface, instead of the raw keys array. Fixes open-policy-agent#4625. Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
daa2f65
to
092150b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This commit introduces lazy key slice sorting for the Set type, similar to what was done for Object types in #4830. After this change, sorting of the Set type's key slice will be delayed until just-before-use, identically to how lazy key slice sorting is done for the Object type. This will move the sorting overhead from construction-time for Sets over to evaluation-time, allowing much more efficient construction and use of enormous (500k+ item) Sets. This appears to be a performance-neutral change overall, while dramatically improving performance for the "large set" edge case. Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
Planned Changes
This PR changes the sorting behavior of the
ast.Object
type, from sorting on each insertion, to sorting only when a canonical key ordering is required, and the keys are not already in sorted order. This dramatically changes the asymptotics of several operations forast.Object
, and requires that all future users ofobj.keys
useobj.keysSorted()
instead.Externally (at the language level) no key ordering is guaranteed, but internally, we need a canonical key ordering-- which sorted keys provide.
Historical Context: Originally, an object's keys were assumed to be in unsorted order, and required sorting at each use site that needed a canonical key order. In the changes of #3823, we enforced keys always being in sorted order by sorting them at insertion time, and this provided implementation complexity and performance benefits across the compiler, while correcting a few subtle bugs that were affecting downstream users, like Gatekeeper.
Proposed Behavior: The key difference between this PR's behavior and prior behaviors for
ast.Object
is that the keys are supposed to be in sorted order by default now, but might be temporarily not in sorted order. We lazily correct the sorting at usage sites where anast.Object
might produce keys for external consumption. This should keep the correctness and implementation complexity benefits from #3823, but allow for improved asymptotic behavior overall, and especially for insertions.This PR will swap out the current implementation ofast.Object
to use a linked-list, or other data structure for storing keys. This is expected to dramatically improve insertion performance as objects are constructed, since it will cut out the existing realloc overheads of the current slice-based implementation.To preserve ordering requirements, I intend to implement Insertion Sorting during(See comment thread below for how this naive approach didn't work out too well. 😅 )Object.Insert
operations, since this can be done online at little additional cost over simply appending to the linked list.Task List
BenchmarkObjectConstruction
is perfect for the job, and already exists!)Object.Elem
uses in the OPA codebase to see what breaks.Object.Elem
anymore.Object.Iter
didn't work out, may need to construct some kind of iterator for keys to keep original continuation-passing callback style)ObjectKeysIterator
approach does seem to work!term.go
Object.Insert
, or underlying helper functions.Object.Insert
BenchmarkObjectConstruction
may need minor revisions, but covers the "all inserts, no reads" case nicely.BenchmarkObjectStringInterfaces
benchmark. (Covers the "many Insert() + Get()" case)BenchmarkObjectCreationAndLookup
benchmark. (Forces key sorting overheads.)Fixes #4625.