Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Insertion Rework #4830

Conversation

philipaconrad
Copy link
Contributor

@philipaconrad philipaconrad commented Jun 29, 2022

Planned Changes

This PR changes the sorting behavior of the ast.Object type, from sorting on each insertion, to sorting only when a canonical key ordering is required, and the keys are not already in sorted order. This dramatically changes the asymptotics of several operations for ast.Object, and requires that all future users of obj.keys use obj.keysSorted() instead.

Externally (at the language level) no key ordering is guaranteed, but internally, we need a canonical key ordering-- which sorted keys provide.

Historical Context: Originally, an object's keys were assumed to be in unsorted order, and required sorting at each use site that needed a canonical key order. In the changes of #3823, we enforced keys always being in sorted order by sorting them at insertion time, and this provided implementation complexity and performance benefits across the compiler, while correcting a few subtle bugs that were affecting downstream users, like Gatekeeper.

Proposed Behavior: The key difference between this PR's behavior and prior behaviors for ast.Object is that the keys are supposed to be in sorted order by default now, but might be temporarily not in sorted order. We lazily correct the sorting at usage sites where an ast.Object might produce keys for external consumption. This should keep the correctness and implementation complexity benefits from #3823, but allow for improved asymptotic behavior overall, and especially for insertions.

This PR will swap out the current implementation of ast.Object to use a linked-list, or other data structure for storing keys. This is expected to dramatically improve insertion performance as objects are constructed, since it will cut out the existing realloc overheads of the current slice-based implementation.

To preserve ordering requirements, I intend to implement Insertion Sorting during Object.Insert operations, since this can be done online at little additional cost over simply appending to the linked list. (See comment thread below for how this naive approach didn't work out too well. 😅 )

Task List

  • Find or build a performance benchmark. (BenchmarkObjectConstruction is perfect for the job, and already exists!)
  • Remove all Object.Elem uses in the OPA codebase to see what breaks.
    • Patch evaluator's Object biunification to not need Object.Elem anymore.
      • ❌ Naive implementation with Object.Iter didn't work out, may need to construct some kind of iterator for keys to keep original continuation-passing callback style)
      • ✔️ The ObjectKeysIterator approach does seem to work!
  • Rework key storage for Objects.
    • Change implementation and propagate changes to all relevant sites in term.go
    • InsertionSort implementation in Object.Insert, or underlying helper functions.
      • ❌ This wound up being a terrible idea, since InsertionSort can have an O(N) cost in the worst case. (Reverted these changes)
    • ✔️ Made key sorting lazy.
      • Object construction benchmark now shows nearly linear scaling for Object.Insert
  • Performance/memory evaluation.
    • New representative benchmark tests
      • Ensures that we see the true costs around lazy key sorting, and can see improvements over time.
      • Existing BenchmarkObjectConstruction may need minor revisions, but covers the "all inserts, no reads" case nicely.
      • ✔️ BenchmarkObjectStringInterfaces benchmark. (Covers the "many Insert() + Get()" case)
      • ✔️ BenchmarkObjectCreationAndLookup benchmark. (Forces key sorting overheads.)
  • Document the key sortedness change for library users of OPA?

Fixes #4625.

@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jun 30, 2022

I'm thinking it may make sense to introduce a new ObjectKeysIterator interface in ast/term.go, which would look roughly like the following:

type ObjectKeysIterator interface {
	Next() (*Term, bool)
	Value() *Term
	Reset()
}

This would still allow incremental iteration across the keys of an object, in a predictable order (assuming the implementation allows for that!), and weakens/modifies the contract on the implementation of the key storage from "must be indexable with an integer" to "Object must support providing an iterator over its keys".

Since these are going to be stateful iterators, the concrete types can (and should!) keep state information internally, and NOT expose that to the callers. For example, under the current implementation, the state might be an integer index, but in the future, the state might be a pointer to the next linked list element. 😄

My plan with this new type would be to revert my changes using Object.Iter (that are causing the continuation-passing style callbacks to explode in the evaluator at the moment), and instead plumb in iterators, which could be passed down the callback chain.

ast/term.go Outdated Show resolved Hide resolved
@srenatus
Copy link
Contributor

So I was just beginning to think about what we'd need to do keep the evaluator intact without Elem(). A few thoughts in, I've come to realise that it is exactly what you've proposed there with the ObjectKeysIterator 👍

@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from 31d20bd to a48ad07 Compare June 30, 2022 15:36
ast/term.go Outdated Show resolved Hide resolved
@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jun 30, 2022

So, after some thought, I've realized I have 2 implementation paths I can go down for the linked-list data structure changes, each of which has tradeoffs. @tsandall, @srenatus if you two have any guidance here, let me know, because I'm leaning towards Option2, as that one keeps memory usage net-neutral, is only a little more work to do, and hopefully won't kill performance for most users.

Option 1 - Naive linked-list

This approach replaces the slice of *objectElem pointers for keys with a linked-list of *objectElem pointers. The only affected sites will be the ones that were using slice index lookups, which will have to be rewritten to walk the linked list instead.

Pro:

  • Simple to implement.
  • Minimal changes to existing logic.

Con:

  • Will burn an extra pointer's worth of storage for every key in the Object. (On a 64-bit machine, this will be N * 8 bytes, where N is the number of keys)
    • Note: on a 1M key document, this will add ~7.6 MB of overhead, just for linked-list pointers.

Option 2 - Linked-list for keys, optimized objectElem structure for values

This approach will split the storage of keys and values. Keys will be stored hanging off the linked-list of keys, and values will be stored exactly as before in the values map. The key pointer field can then be removed from the objectElem struct, as long as all references to keys are rewritten to pull from the new key data structure (exact list of affected sites below under "Cons").

Pro:

  • Net-neutral memory consumption. (Same total number of pointers involved, just in different spots.)

Neutral:

  • Separation of key storage from value storage.

Con:

  • Slightly more complex to implement.
  • Will affect Compare, MarshalJSON, and String by changing O(1) lookups into O(1 + c), where c is the cost of the hashing to look up a value in the hash table for each key.

@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jul 1, 2022

Well, I tried an initial stab at Option 1, the "naive linked-list" approach, and it... did more harm than good. 😬 It makes sense, now that I think about it. 🤔 Locally, these are the results I'm getting (compare with earlier results, which took 15s total):

$ go test -benchmem -run=^$ -bench ^BenchmarkObjectConstruction$ github.com/open-policy-agent/opa/ast
goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectConstruction/shuffled_keys/5-8       	  602432	      2026 ns/op	     840 B/op	      33 allocs/op
BenchmarkObjectConstruction/shuffled_keys/50-8      	   30066	     40295 ns/op	    8590 B/op	     308 allocs/op
BenchmarkObjectConstruction/shuffled_keys/500-8     	     490	   2430653 ns/op	  105402 B/op	    3823 allocs/op
BenchmarkObjectConstruction/shuffled_keys/5000-8    	       4	 289052964 ns/op	  981300 B/op	   39915 allocs/op
BenchmarkObjectConstruction/shuffled_keys/50000-8   	       1	97118023271 ns/op	 9311032 B/op	  401590 allocs/op
BenchmarkObjectConstruction/increasing_keys/5-8     	  477085	      2407 ns/op	     840 B/op	      33 allocs/op
BenchmarkObjectConstruction/increasing_keys/50-8    	   18134	     71795 ns/op	    8592 B/op	     308 allocs/op
BenchmarkObjectConstruction/increasing_keys/500-8   	     249	   4671049 ns/op	  105344 B/op	    3822 allocs/op
BenchmarkObjectConstruction/increasing_keys/5000-8  	       3	 476289577 ns/op	  978408 B/op	   39903 allocs/op
BenchmarkObjectConstruction/increasing_keys/50000-8 	       1	51654662472 ns/op	 9309064 B/op	  401714 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	164.661s

My intuition suggests that insertion sort, as well as linear iteration across the keys in several spots are likely causing a good chunk of the slowdown, and a tree-based data structure might be worth trying out instead, since that would greatly improve lookup times where it matters.

I'm going to do some more digging to see exactly what the culprit is before investing a lot of dev time into anything though.

@philipaconrad
Copy link
Contributor Author

After some more thought (and a lot of research into tree data structures), a possibility that occurred to me while explaining this issue to @charlesdaniels was that instead of making exotic data structure changes... what if we changed when we do the sorting of the keys?

Option 3 - Lazy sorting for keys

This approach will modify how and when keys are sorted for Objects, without switching out the slice-based key storage data structure. Each Insert will directly append the new key to the end of the keys slice, and will increment a counter stored in the object struct. Once this counter reaches a certain threshold, or an operation requiring the keys to be in sorted order occurs, we sort the keys slice in-place using an off-the-shelf sorting algorithm, such as MergeSort or QuickSort.

Pro:

  • This changes inserts from O(log N) + O(c * N) cost (where c is the percentage chance of a realloc event), to being O(1).
  • Sorting is only done when needed, instead of after every insertion.

Con:

  • First access of the keys after any modifications will cost O(n log n) to O(n^2)
    • Successive accesses will be O(1), as before
  • Memory cost of sorting could range from O(1), up to O(N), depending on sorting algorithm choice.

History Note

Relevant PRs:

It looks like this approach would directly revert historical changes around when/where key sorting happens, which is not ideal, and suggests that I will need to do a good bit of investigating before actually implementing this approach fully. 🤔

@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from 5bd4249 to 17ee2fb Compare July 3, 2022 20:37
@philipaconrad
Copy link
Contributor Author

I've investigated making an Object's key sorting lazier over the weekend, and here are the benchmark results for Object construction:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkObjectConstruction$ github.com/open-policy-agent/opa/ast

goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectConstruction/shuffled_keys/5-8       	  676078	      1947 ns/op	     896 B/op	      32 allocs/op
BenchmarkObjectConstruction/shuffled_keys/50-8      	   72123	     16896 ns/op	    8821 B/op	     265 allocs/op
BenchmarkObjectConstruction/shuffled_keys/500-8     	    4444	    348357 ns/op	  105513 B/op	    3332 allocs/op
BenchmarkObjectConstruction/shuffled_keys/5000-8    	     402	   2597975 ns/op	 1029380 B/op	   34930 allocs/op
BenchmarkObjectConstruction/shuffled_keys/50000-8   	      46	  26187178 ns/op	10479400 B/op	  351705 allocs/op
BenchmarkObjectConstruction/increasing_keys/5-8     	  601208	      1667 ns/op	     896 B/op	      32 allocs/op
BenchmarkObjectConstruction/increasing_keys/50-8    	   63993	     17383 ns/op	    8822 B/op	     265 allocs/op
BenchmarkObjectConstruction/increasing_keys/500-8   	    6051	    201965 ns/op	  105536 B/op	    3332 allocs/op
BenchmarkObjectConstruction/increasing_keys/5000-8  	     590	   2018079 ns/op	 1028313 B/op	   34930 allocs/op
BenchmarkObjectConstruction/increasing_keys/50000-8 	      46	  26067284 ns/op	10459946 B/op	  351709 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	14.005s

The main advantage of doing the key sorting lazily is that it will provide a massive win on asymptotics for insertions, and for most other situations will not make the asymptotics dramatically worse (I have a table of asymptotics I worked out for the basic operations and common amortized use cases, available on request). The first "read" of the keys after a sequence of insertions will be expensive (O(n log n) to O(N^2)), but all successive reads will be O(1).

This change also opens the door to more performance tunables in the future, such as using an adaptive sorting algorithm (e.g. Timsort) that could make the occasional re-sorting of keys much faster/more memory efficient than the default Golang sort (which is just Quicksort under the hood).

@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from 17ee2fb to 574f623 Compare July 5, 2022 16:43
@philipaconrad philipaconrad requested a review from srenatus July 5, 2022 16:51
@philipaconrad philipaconrad marked this pull request as ready for review July 5, 2022 22:27
ast/term.go Outdated Show resolved Hide resolved
ast/term.go Outdated Show resolved Hide resolved
ast/term.go Outdated Show resolved Hide resolved
ast/term.go Show resolved Hide resolved
ast/term.go Show resolved Hide resolved
@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from 574f623 to ef2ed51 Compare July 6, 2022 17:21
ast/term.go Outdated Show resolved Hide resolved
@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch 2 times, most recently from 82cb44c to 463dc22 Compare July 7, 2022 21:23
@philipaconrad
Copy link
Contributor Author

New Benchmark Results!

String/MarshalJSON benchmark

The new BenchmarkObjectStringInterfaces benchmark test is constructed to get end-caller costs stringifying a freshly-constructed, never-sorted ast.Object.

tl;dr: This benchmark shows a 40-60% degradation in String() performance for Objects under 50,000 keys for the new PR. At larger sizes, other overheads seem to dominate.

main branch:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkObjectStringInterfaces$ github.com/open-policy-agent/opa/ast

goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectStringInterfaces/5/String()-8         	 2017899	       536.1 ns/op	     208 B/op	      19 allocs/op
BenchmarkObjectStringInterfaces/5/json.Marshal-8     	   66445	     17148 ns/op	    9351 B/op	     154 allocs/op
BenchmarkObjectStringInterfaces/50/String()-8        	  240212	      4604 ns/op	    1824 B/op	     157 allocs/op
BenchmarkObjectStringInterfaces/50/json.Marshal-8    	    7279	    167472 ns/op	   93431 B/op	    1504 allocs/op
BenchmarkObjectStringInterfaces/500/String()-8       	   22614	     58250 ns/op	   35712 B/op	    1514 allocs/op
BenchmarkObjectStringInterfaces/500/json.Marshal-8   	     727	   1670425 ns/op	  938362 B/op	   15007 allocs/op
BenchmarkObjectStringInterfaces/5000/String()-8      	    1869	    659141 ns/op	  369045 B/op	   11022 allocs/op
BenchmarkObjectStringInterfaces/5000/json.Marshal-8  	      63	  19498918 ns/op	 9758609 B/op	  150071 allocs/op
BenchmarkObjectStringInterfaces/50000/String()-8     	     100	  13171458 ns/op	 4930664 B/op	  101032 allocs/op
BenchmarkObjectStringInterfaces/50000/json.Marshal-8 	       6	 197457908 ns/op	100538726 B/op	 1500113 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	17.483s

This PR:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkObjectString$ github.com/open-policy-agent/opa/ast

goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectString/5/String()-8         	 1654652	       845.0 ns/op	     232 B/op	      20 allocs/op
BenchmarkObjectString/5/json.Marshal-8     	   64318	     19067 ns/op	    9351 B/op	     154 allocs/op
BenchmarkObjectString/50/String()-8        	  139419	      8393 ns/op	    1848 B/op	     158 allocs/op
BenchmarkObjectString/50/json.Marshal-8    	    6308	    170868 ns/op	   93427 B/op	    1504 allocs/op
BenchmarkObjectString/500/String()-8       	    8152	    132313 ns/op	   35736 B/op	    1515 allocs/op
BenchmarkObjectString/500/json.Marshal-8   	     735	   1763056 ns/op	  939259 B/op	   15007 allocs/op
BenchmarkObjectString/5000/String()-8      	     591	   1921544 ns/op	  369068 B/op	   11023 allocs/op
BenchmarkObjectString/5000/json.Marshal-8  	      60	  19344535 ns/op	 9815460 B/op	  150075 allocs/op
BenchmarkObjectString/50000/String()-8     	      30	  37746086 ns/op	 4930686 B/op	  101033 allocs/op
BenchmarkObjectString/50000/json.Marshal-8 	       6	 197132471 ns/op	104035845 B/op	 1500136 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	14.362s

Insert/Get benchmark

The BenchmarkObjectCreationAndLookup benchmark attempts to construct progressively larger ast.Objects incrementally using Insert(), and then accesses all of the keys sequentially using Get().

tl;dr: This benchmark shows about a 1-5% performance difference for the new PR for smaller Object sizes (<5,000 keys), and around a 10% performance degradation for the PR branch at 5,000 keys (I suspect this could still be mostly within measurement errors at these key counts). However! Once we get into the 100k+ keys range things get very interesting.

I had to manually tweak the benchmark size for the main branch, because insertion times go exponential in the 120k-140k key range. 😦 For sake of demonstration, the 138k key case is shown below for main. At 140k keys, the benchmark test only runs once in the default time allotted by go test for main. 😬

The PR branch however, continues to scale almost linearly in performance up through 500k keys. 😃

main branch (goes exponential near 138k keys):

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkObjectConstructionAndLookup$ github.com/open-policy-agent/opa/ast

goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectConstructionAndLookup/5-8 	53736560	        22.33 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectConstructionAndLookup/50-8         	41868780	        32.24 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectConstructionAndLookup/500-8        	41441917	        29.44 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectConstructionAndLookup/5000-8       	44001571	        27.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectConstructionAndLookup/50000-8      	30690480	        34.23 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectConstructionAndLookup/138000-8     	    6853	    176935 ns/op	    4890 B/op	     161 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	45.395s

This PR (500k keys):

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkObjectCreationAndLookup$ github.com/open-policy-agent/opa/ast

goos: linux
goarch: amd64
pkg: github.com/open-policy-agent/opa/ast
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkObjectCreationAndLookup/5-8 	50913422	        23.40 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectCreationAndLookup/50-8         	41369619	        29.27 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectCreationAndLookup/500-8        	40482777	        34.52 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectCreationAndLookup/5000-8       	39638308	        32.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectCreationAndLookup/50000-8      	30604346	        35.46 ns/op	       0 B/op	       0 allocs/op
BenchmarkObjectCreationAndLookup/500000-8     	22416054	        44.85 ns/op	       5 B/op	       0 allocs/op
PASS
ok  	github.com/open-policy-agent/opa/ast	14.265s

Comments:

  • Profiling confirms that the reason that the main branch goes exponential is because the 138k+ key range is where full slice reallocs/memmove operations start happening on every insert.

@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from 463dc22 to cbefe72 Compare July 7, 2022 22:23
@philipaconrad philipaconrad requested a review from srenatus July 7, 2022 22:26
@philipaconrad
Copy link
Contributor Author

@srenatus When you're satisfied with the new benchmark tests and other code aspects you commented on, I think we're ready for merge! 😄

srenatus
srenatus previously approved these changes Jul 8, 2022
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final round of nitpicks -- LGTM otherwise. Feel free to merge (ideally with some of the nitpicks addressed 😉)

ast/term.go Outdated Show resolved Hide resolved
ast/term.go Show resolved Hide resolved
ast/term.go Outdated Show resolved Hide resolved
topdown/eval.go Outdated Show resolved Hide resolved
@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from cbefe72 to daa2f65 Compare July 8, 2022 14:37
This commit delays the sorting of keys until just-before-use. This
is a net win on asymptotics as Objects get larger, even with Quicksort
as the sorting algorithm.

This commit also adjusts the evaluator to use the new ObjectKeysIterator
interface, instead of the raw keys array.

Fixes open-policy-agent#4625.

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
@philipaconrad philipaconrad force-pushed the issue-4625-object-insertion-performance branch from daa2f65 to 092150b Compare July 8, 2022 14:37
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@philipaconrad philipaconrad merged commit d2ca577 into open-policy-agent:main Jul 8, 2022
philipaconrad added a commit that referenced this pull request Aug 15, 2022
This commit introduces lazy key slice sorting for the Set type, similar to what was done for Object types in #4830. After this change, sorting of the Set type's key slice will be delayed until just-before-use, identically to how lazy key slice sorting is done for the Object type.

This will move the sorting overhead from construction-time for Sets over to evaluation-time, allowing much more efficient construction and use of enormous (500k+ item) Sets. This appears to be a performance-neutral change overall, while dramatically improving performance for the "large set" edge case.

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
@philipaconrad philipaconrad deleted the issue-4625-object-insertion-performance branch September 14, 2022 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance: creation of object with many key/val pairs
2 participants