-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk-based Storage #4014
Comments
Hi, If you need help for something may be i can help.
If you want i can copy past code (or make a pr) if it could be useful. It's not a big part of the problem, i think. For 2 first point (refactoring) is may be little hard for me. For optimisation stream-to-disk, i have take a look but it's seem's there lot of code involve and it's also little hard. But if i can help you on some task why not. |
Hi @floriangasc - I agree that the first two points are probably the hardest. I think we need to get those sorted out first since they're the highest risk. I'm hoping that we'll work on this in the next 2-3 months. |
Hi @tsandall I will try to use in production mirroring open policy agent with disk store enabled. Here some feed back. I don't know if you already know, but if it could be help:
Some useful link i have found during my research on badger. The disk store ram consumption issue is not coupled to number of data loaded, it's more about a minimum ram need by badger to work (according to data loaded, and data access and data updated) Finally after study opa github issue about this kind of problem, i have choose to develop a plugin, that use opa downloader for download bundle localy. But bundle are not activated. It's just localy present in sidecar. It's activated on demand, according input, for a defined duration (like 5 or 10mn according production feedback). Now opa consume 512 mo de ram in the worst case, policy are evaluated in 10 micro second in worst case. Bundle are keep up-to-date due to long polling. And ram consumption not depend on all user, group and so on present in database, but only those connected at the specific service. So plugin is the most promising solution, I have abandoned to use disk store, because it's cause to ram consumption. If you want any more information, let me know. Thanks for all your work, opa has very good code (easy to read and learn and update) and it's solve a lot of problems in clean way for us. |
@floriangasc sorry for the delayed reply--this got buried in updates over the holidays. Is it possible to share the code you used to stress test the disk store in OPA? I didn't observe the behaviour you're describing when I tested badger almost a year ago but perhaps I missed something. EDIT: Also, I'd just like to say thank you for the detailed writeup of what you observed! |
With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also #4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Signed-off-by: Damien Burks <damien@damienjburks.com> removing space from testcase Signed-off-by: Damien Burks <damien@damienjburks.com> website: make local dev and PR preview not build everything (open-policy-agent#4474) With this change, the work done for local development, and the per-PR netlify preview changes: It will no longer include the website stuff for versions other than edge, the current working tree. We thus save us the time, and the flakiness, involved with - checking if github has release binaries for all the versions - checking out their sources - fetching the release binaries to pre-hydrate old versions' live-blocks. The previously-used, documented make target should still be intact. Fixes open-policy-agent#4379 to some extent, I hope. * docs/website: remove "latest" binary from opa versions cache Having a stale binary here -- one called "latest" but not actually being "latest" -- causes issues like this: when building the website content for the (real) latest version, the script would take the old (previous-latest) binary, and fail because that binary didn't know the latest future keywords. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix download link redirect (open-policy-agent#4482) Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail to retrieve the "latest" opa binary. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481) https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/website: fix /docs redirect (open-policy-agent#4483) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix template to produce proper redirects (open-policy-agent#4484) Also redirect /docs -> /docs/edge in preview Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459) Docs for open-policy-agent/opa-envoy-plugin#323 Signed-off-by: Elliot Maincourt <e.maincourt@gmail.com> website/live-blocks: update caniuse/browserlist Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website/live-blocks: don't call the github api to determine release asset urls Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: don't run generate twice (open-policy-agent#4488) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> ast: Fix print call rewriting in else rules The compiler was accidentally checking/rewriting print calls in the body of else rules before the implicit args of the else rule were processed. As a result, the compiler was generating false-positive errors for refs to undeclared args. The root of the problem was the usage of WalkBodies on the top-level rule (which implicitly walks the bodies of else rules under the top-level rule). With this change, the compiler will call WalkBodies on the head and body of each rule rather than the entire rule itself (which includes the else chain). Fixes open-policy-agent#4489 Signed-off-by: Torin Sandall <torinsandall@gmail.com> Update maintainer status, add Will Beason (open-policy-agent#4465) Signed-off-by: Max Smythe <smythe@google.com> Co-authored-by: Will Beason <willbeason@google.com> Make maintainers list alphabetical (open-policy-agent#4491) Signed-off-by: Max Smythe <smythe@google.com> plugins/logs: Fix broken retry logic We were incorrectly resetting the retry counter after every error condition instead of using the incremented value. As a result, retry delay would always be 0s. This meant that if OPA encountered an error while uploading decision logs it would immediately retry instead of doing an exponential backoff. Fixes: open-policy-agent#4486 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com> logging: mask authorization header value in debug logs (open-policy-agent#4496) Fixes open-policy-agent#4495 Signed-off-by: Anders Eknert <anders@eknert.com> Fixes open-policy-agent#4376 (open-policy-agent#4494) Do note though that this does not change how multiple with statements are grouped. Although I agree with that, it's IMHO a separate feature request, while the spacing issue is a bug. Signed-off-by: Anders Eknert <anders@eknert.com> runtime: improve log output for binary response (open-policy-agent#4498) This change omits the response body when using compression on metrics endpoint and when pprof is enabled. Signed-off-by: Christian Altamirano <christian.altamirano.ayala@gmail.com> build: bump golang: 1.17 -> 1.18 No change to go.mod's `go` stanza, so no changes in code compatibility. However, it's used for building our docker images and release binaries, and for fuzz testing in our nightly workflow. Some test-related changes with the dns lookup built-in function's error handling; and the hardcoded signature. Running go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000 makes me believe that for whatever reason the signature changed, it's at least stable. topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18): Certificate.Verify now uses platform APIs to verify certificate validity on macOS and iOS when it is called with a nil VerifyOpts.Roots or when using the root pool returned from SystemCertPool. We're keeping the old message for go <= 1.17; in a silly-simple way. Also: * ci: build and test two old golang version on macos|linux We'll drop golang 1.15, keep one unsupported version (1.16). Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> ci: remove go-fuzz, use native go 1.18 fuzzer Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501) update bob_token and alice_token in markdown (open-policy-agent#4504) Signed-off-by: yongen.pan <yongen.pan@daocloud.io> topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503) What we previously did turned into a race condition with multiple concurrent calls to /v1/compile. With a change introduced with 0.38.0 (the `every` keyword), the `nil` Terms of an `ast.Expr` node was surfaced: previously, it would go unnoticed, but could potentially have yielded bad results. The effect of this change is proven using a new e2e test that would fail on the code we had previous. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> uploading latest changes Signed-off-by: Damien Burks <damien@damienjburks.com> fixing test cases Signed-off-by: Damien Burks <damien@damienjburks.com> sdk: avoid using different state (open-policy-agent#4505) I noticed that when operating on opa.state, locking is usually done to avoid a race, whereas here opa.state is used directly. By comparing the previous changes, I found that open-policy-agent#4240 changed the previous behaviour. This change adjusts that: we ensure that we work on the state as read via the mutex-protected `s := *opa.state`. Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io> build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467) https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0 https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1 https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512) This is a left-over from the previous bump, 0856120. https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513) When running on GHA, we've found this test to often fail on macos-latest: It would not functionally be wrong, but it also wasn't able to finish in the alloted time. Now, the maximum delta has been increased a lot (10ms to 500ms). It's much, but it's still good enough to ensure that the context passed to Stop() is the one that matters for shutdown. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs: Add clarifications for docs describing bundle signing and delta features This commit adds some clarification to the bundle doc regarding the format of bundle signatures and manifest behavior for delta bundles. Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com> runtime: change processed file event log level to info (open-policy-agent#4514) "Processed file watch event." will now be logged at level "info" (was "warn"). Signed-off-by: Anders Eknert <anders@eknert.com> runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502) * runtime/logging: only suppress payloads for handlers that compress responses To get compressed responses, two things need to be true: 1. The client must accept compressed responses 2. The handler must reply with a compressed response For general API requests, (1) holds most of the time. (2) is only true for the metrics endpoint at them moment, since the 3rd party library we use for serving the prometheus endpoint will do compression. * runtime/logging: remove dead code The http.Hijack stuff was related to a watch feature removed in open-policy-agent@186ef99 dropInputParam was only used by its tests. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> storage: code cosmetics Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> storage: allow implementations to override MakeDir Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> runtime+storage: integrate disk storage With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517) Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix community site (open-policy-agent#4522) * remove hugo templating This page doesn't go through hugo at all, so these things would end up on the website. A quick fix is reverting the bits that added the templating. The downside is that local dev and PR previews will link to /docs/latest/ecosystem, which doesn't exist there. * link Documentation and Download to /docs For /docs/ (not /docs/latest) we have a redirect in local dev and PR preview mode: /docs -> /docs/edge so this will be a little less broken: only ecosystem is a 404 link on the preview. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> integrations: Add Emissary-Ingress (open-policy-agent#4523) This PR adds the Emissary ingress to the Ecosystem page. The blog section contains link to an article which guides on Emissary ingress and OPA integrations. Signed-off-by: Tayyab J <tayyab.jamadar@styra.com> storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519) A bunch of smaller follow-up tasks to open-policy-agent#4381. * storage/disk_test: check invalid patches with wildcard partition, too * docs/disk: add caveat re: bundles loaded into memory * storage/disk: auto-manage /system partitions If these are found in the user-provided partitions, we'll error out. * storage/disk: pretty-print partitions with "*" instead of %2A * storage/disk: respect wildcard-replacement in partition validation It is now allowed to replace a partition like /foo/bar by /foo/* also if multiple wildcards are used. Caveats: You cannot add a wildcard partition like /*/*, since it would overlap the managed "/system/*" partition. When attempting to go back from /foo/* to /foo/bar, an error is raised _unconditionally_ -- we could check the existing data, but currently don't. * storage/disk: check prefix when adding wildcard partitions The previously done check would have falsely returned that there is no problem when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted as a wildcard, but as a string, would yield a not-found, even if there was any data under /foo/. Now, we'll check the prefix-until-wildcard. It's more cautious than theoretically necessary, but safe. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website/style: fix code block bottom margin (open-policy-agent#4526) The text following regular code blocks was too close. Now, the bottom margin is the same was used with live-blocks, 1.5rem (24px). Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs: Add CloudFormation Hook tutorial (open-policy-agent#4525) Obviously missing is deployment of OPA to AWS. This will come in the next iteration. Signed-off-by: Anders Eknert <anders@eknert.com> Prepare v0.39.0 release (open-policy-agent#4524) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> Prepare v0.40.0 development (open-policy-agent#4527) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix styra support link Signed-off-by: Torin Sandall <torinsandall@gmail.com> adopters: Add Wealthsimple (open-policy-agent#4530) Signed-off-by: hannahyeates <hannahgrey06@gmail.com> ci: fix rego check (open-policy-agent#4532) * build/policies: format using 0.39.0 * workflow/pull-request: use edge opa for rego PR checks Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: remove right margin on sidebar (open-policy-agent#4529) * remove right margin on sidebar * replace the removed margin with padding Signed-off-by: orweis <orweis@gmail.com> website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534) To Fix: ``` 2 errors occurred: policy.rego:13: rego_compile_error: assigned var header unused policy.rego:13: rego_compile_error: assigned var signature unused ``` Signed-off-by: panpan0000 <panpan0000@msn.com> website: show yellow banner for old version (open-policy-agent#4533) Hopefully this is gets more attention than the yellow "older" button in the version picker. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> fixing disk diagnostic test case
@srenatus just to double check... the last task on the list above ("investigate whether server integration tests can be run with disk-based storage to show conformance") is complete, right? Perhaps we should close this issue and file a new one to track the optimizations/improvements for bundle loading. |
Yes, a select few have been changed to run with both inmem and disk storage; notably those checking the Agreed, I'll close this one in favour of an issue focusing on bundles. |
Hi. Sorry for the delay, I have forgotten to respond. I have use istio mirroring capability.
ps:
|
Signed-off-by: Damien Burks <damien@damienjburks.com> removing space from testcase Signed-off-by: Damien Burks <damien@damienjburks.com> website: make local dev and PR preview not build everything (open-policy-agent#4474) With this change, the work done for local development, and the per-PR netlify preview changes: It will no longer include the website stuff for versions other than edge, the current working tree. We thus save us the time, and the flakiness, involved with - checking if github has release binaries for all the versions - checking out their sources - fetching the release binaries to pre-hydrate old versions' live-blocks. The previously-used, documented make target should still be intact. Fixes open-policy-agent#4379 to some extent, I hope. * docs/website: remove "latest" binary from opa versions cache Having a stale binary here -- one called "latest" but not actually being "latest" -- causes issues like this: when building the website content for the (real) latest version, the script would take the old (previous-latest) binary, and fail because that binary didn't know the latest future keywords. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix download link redirect (open-policy-agent#4482) Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail to retrieve the "latest" opa binary. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481) https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/website: fix /docs redirect (open-policy-agent#4483) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix template to produce proper redirects (open-policy-agent#4484) Also redirect /docs -> /docs/edge in preview Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459) Docs for open-policy-agent/opa-envoy-plugin#323 Signed-off-by: Elliot Maincourt <e.maincourt@gmail.com> website/live-blocks: update caniuse/browserlist Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website/live-blocks: don't call the github api to determine release asset urls Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: don't run generate twice (open-policy-agent#4488) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> ast: Fix print call rewriting in else rules The compiler was accidentally checking/rewriting print calls in the body of else rules before the implicit args of the else rule were processed. As a result, the compiler was generating false-positive errors for refs to undeclared args. The root of the problem was the usage of WalkBodies on the top-level rule (which implicitly walks the bodies of else rules under the top-level rule). With this change, the compiler will call WalkBodies on the head and body of each rule rather than the entire rule itself (which includes the else chain). Fixes open-policy-agent#4489 Signed-off-by: Torin Sandall <torinsandall@gmail.com> Update maintainer status, add Will Beason (open-policy-agent#4465) Signed-off-by: Max Smythe <smythe@google.com> Co-authored-by: Will Beason <willbeason@google.com> Make maintainers list alphabetical (open-policy-agent#4491) Signed-off-by: Max Smythe <smythe@google.com> plugins/logs: Fix broken retry logic We were incorrectly resetting the retry counter after every error condition instead of using the incremented value. As a result, retry delay would always be 0s. This meant that if OPA encountered an error while uploading decision logs it would immediately retry instead of doing an exponential backoff. Fixes: open-policy-agent#4486 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com> logging: mask authorization header value in debug logs (open-policy-agent#4496) Fixes open-policy-agent#4495 Signed-off-by: Anders Eknert <anders@eknert.com> Fixes open-policy-agent#4376 (open-policy-agent#4494) Do note though that this does not change how multiple with statements are grouped. Although I agree with that, it's IMHO a separate feature request, while the spacing issue is a bug. Signed-off-by: Anders Eknert <anders@eknert.com> runtime: improve log output for binary response (open-policy-agent#4498) This change omits the response body when using compression on metrics endpoint and when pprof is enabled. Signed-off-by: Christian Altamirano <christian.altamirano.ayala@gmail.com> build: bump golang: 1.17 -> 1.18 No change to go.mod's `go` stanza, so no changes in code compatibility. However, it's used for building our docker images and release binaries, and for fuzz testing in our nightly workflow. Some test-related changes with the dns lookup built-in function's error handling; and the hardcoded signature. Running go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000 makes me believe that for whatever reason the signature changed, it's at least stable. topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18): Certificate.Verify now uses platform APIs to verify certificate validity on macOS and iOS when it is called with a nil VerifyOpts.Roots or when using the root pool returned from SystemCertPool. We're keeping the old message for go <= 1.17; in a silly-simple way. Also: * ci: build and test two old golang version on macos|linux We'll drop golang 1.15, keep one unsupported version (1.16). Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> ci: remove go-fuzz, use native go 1.18 fuzzer Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501) update bob_token and alice_token in markdown (open-policy-agent#4504) Signed-off-by: yongen.pan <yongen.pan@daocloud.io> topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503) What we previously did turned into a race condition with multiple concurrent calls to /v1/compile. With a change introduced with 0.38.0 (the `every` keyword), the `nil` Terms of an `ast.Expr` node was surfaced: previously, it would go unnoticed, but could potentially have yielded bad results. The effect of this change is proven using a new e2e test that would fail on the code we had previous. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> uploading latest changes Signed-off-by: Damien Burks <damien@damienjburks.com> fixing test cases Signed-off-by: Damien Burks <damien@damienjburks.com> sdk: avoid using different state (open-policy-agent#4505) I noticed that when operating on opa.state, locking is usually done to avoid a race, whereas here opa.state is used directly. By comparing the previous changes, I found that open-policy-agent#4240 changed the previous behaviour. This change adjusts that: we ensure that we work on the state as read via the mutex-protected `s := *opa.state`. Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io> build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467) https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0 https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1 https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512) This is a left-over from the previous bump, 0856120. https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1 Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513) When running on GHA, we've found this test to often fail on macos-latest: It would not functionally be wrong, but it also wasn't able to finish in the alloted time. Now, the maximum delta has been increased a lot (10ms to 500ms). It's much, but it's still good enough to ensure that the context passed to Stop() is the one that matters for shutdown. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs: Add clarifications for docs describing bundle signing and delta features This commit adds some clarification to the bundle doc regarding the format of bundle signatures and manifest behavior for delta bundles. Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com> runtime: change processed file event log level to info (open-policy-agent#4514) "Processed file watch event." will now be logged at level "info" (was "warn"). Signed-off-by: Anders Eknert <anders@eknert.com> runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502) * runtime/logging: only suppress payloads for handlers that compress responses To get compressed responses, two things need to be true: 1. The client must accept compressed responses 2. The handler must reply with a compressed response For general API requests, (1) holds most of the time. (2) is only true for the metrics endpoint at them moment, since the 3rd party library we use for serving the prometheus endpoint will do compression. * runtime/logging: remove dead code The http.Hijack stuff was related to a watch feature removed in open-policy-agent@186ef99 dropInputParam was only used by its tests. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> storage: code cosmetics Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> storage: allow implementations to override MakeDir Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> runtime+storage: integrate disk storage With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517) Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix community site (open-policy-agent#4522) * remove hugo templating This page doesn't go through hugo at all, so these things would end up on the website. A quick fix is reverting the bits that added the templating. The downside is that local dev and PR previews will link to /docs/latest/ecosystem, which doesn't exist there. * link Documentation and Download to /docs For /docs/ (not /docs/latest) we have a redirect in local dev and PR preview mode: /docs -> /docs/edge so this will be a little less broken: only ecosystem is a 404 link on the preview. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> integrations: Add Emissary-Ingress (open-policy-agent#4523) This PR adds the Emissary ingress to the Ecosystem page. The blog section contains link to an article which guides on Emissary ingress and OPA integrations. Signed-off-by: Tayyab J <tayyab.jamadar@styra.com> storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519) A bunch of smaller follow-up tasks to open-policy-agent#4381. * storage/disk_test: check invalid patches with wildcard partition, too * docs/disk: add caveat re: bundles loaded into memory * storage/disk: auto-manage /system partitions If these are found in the user-provided partitions, we'll error out. * storage/disk: pretty-print partitions with "*" instead of %2A * storage/disk: respect wildcard-replacement in partition validation It is now allowed to replace a partition like /foo/bar by /foo/* also if multiple wildcards are used. Caveats: You cannot add a wildcard partition like /*/*, since it would overlap the managed "/system/*" partition. When attempting to go back from /foo/* to /foo/bar, an error is raised _unconditionally_ -- we could check the existing data, but currently don't. * storage/disk: check prefix when adding wildcard partitions The previously done check would have falsely returned that there is no problem when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted as a wildcard, but as a string, would yield a not-found, even if there was any data under /foo/. Now, we'll check the prefix-until-wildcard. It's more cautious than theoretically necessary, but safe. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website/style: fix code block bottom margin (open-policy-agent#4526) The text following regular code blocks was too close. Now, the bottom margin is the same was used with live-blocks, 1.5rem (24px). Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> docs: Add CloudFormation Hook tutorial (open-policy-agent#4525) Obviously missing is deployment of OPA to AWS. This will come in the next iteration. Signed-off-by: Anders Eknert <anders@eknert.com> Prepare v0.39.0 release (open-policy-agent#4524) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> Prepare v0.40.0 development (open-policy-agent#4527) Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: fix styra support link Signed-off-by: Torin Sandall <torinsandall@gmail.com> adopters: Add Wealthsimple (open-policy-agent#4530) Signed-off-by: hannahyeates <hannahgrey06@gmail.com> ci: fix rego check (open-policy-agent#4532) * build/policies: format using 0.39.0 * workflow/pull-request: use edge opa for rego PR checks Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> website: remove right margin on sidebar (open-policy-agent#4529) * remove right margin on sidebar * replace the removed margin with padding Signed-off-by: orweis <orweis@gmail.com> website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534) To Fix: ``` 2 errors occurred: policy.rego:13: rego_compile_error: assigned var header unused policy.rego:13: rego_compile_error: assigned var signature unused ``` Signed-off-by: panpan0000 <panpan0000@msn.com> website: show yellow banner for old version (open-policy-agent#4533) Hopefully this is gets more attention than the yellow "older" button in the version picker. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com> fixing disk diagnostic test case removing additional space Signed-off-by: Damien Burks <damien@damienjburks.com> removing test files Signed-off-by: Damien Burks <damien@damienjburks.com> fixing linting issue Signed-off-by: Damien Burks <damien@damienjburks.com> adding test case for logging Signed-off-by: Damien Burks <damien@damienjburks.com> ecosystem: adding permit.io (open-policy-agent#4531) Signed-off-by: Oz Radiano <oz@permit.io> docs: CloudFormation fixes (open-policy-agent#4541) * Fix step headers to use h3 * Add warning about boolean values conversion Signed-off-by: Anders Eknert <anders@eknert.com>
With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
Now that OPA includes a disk-based implementation of the storage.Store interface, we can work on integrating the implementation into the OPA server. This issue tracks the work that needs to get done and will need to be updated as there are probably unknowns:
The text was updated successfully, but these errors were encountered: