-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add unit test for authz policy #1123
Conversation
What the test doesThe new test does the following:
What the test verifiesThis essentially checks the result of authorizing all possible actions on all possible roles ... for a subset of resources. The subset includes Fleet, Silos, Organizations, Projects, a couple of things inside Projects, and one resource that's two levels below Project (Vpc -> VpcSubnet). It's verifying:
This does not cover all possible resources though. I'd like it to, but there's a problem: The problem: it takes 5 minutesAlready, this test takes almost 5 minutes on my machine. I'll drop some notes below about the performance issue, but I don't see low-hanging fruit. What to do?I'd like thoughts on where to go from here. I can see:
Other ideas? There are a few things I can do to improve the performance of the test, but I don't think it'll help that much:
|
Notes on the test's performanceWhile the test is running, I profiled it for 60s at 97 Hz with DTrace:
(note: 80 frames was not enough -- I needed 160 to capture everything) I demangled this with:
with somewhat mixed results (some frames are now missing useful names, but others are clearer 🤷 ) I used Brendan's tools to make a flame graph:
I'll attach the various data files and flame graph to this comment. It's hard to read the flame graph, but we're clearly spending nearly all of our time in the tokio task that's doing the tests. (That's good? What else would we be doing?) Much of it seems to be inside Oso and I'm not sure why various nearly-identical stacks aren't identical. Let's quantify the Oso bit:
The flame graph tool reports there are 6,090 samples, which matches the result above. This is worth expanding on a bit: profiling on all CPUs for 60s at 97 Hz, for a single-threaded process that was pegged on CPU, we'd expect 60 * 97 = 5,820 samples. In fact, the flame graph shows a second tokio task that's on CPU for 386 samples -- it's the slog async drain. If we assume that's on a separate thread (which I don't know for sure, but seems likely because it will almost always have some work to do) and subtract its 386 samples from the 6,090 total, that leaves 5,704. And we saw above that Oso is on the stack for 5,661 of them -- 99.2%. I think that means Oso is responsible for around 99% of the total wall-clock execution time of this test. (One caveat: the test takes a bit of time up front to set up the resource hierarchy, users, and policies. I didn't profile these. I think this was a very small fraction of the 4m40s that this test took.) Again, this isn't so bad: this test is designed to do nothing but run authz checks, so it's not surprising that we'd be spending nearly all of our time in Oso. The bigger question is how long are these checks taking? The test is taking about 4m40s on my machine:
There are 12 test users times 8 actions = 96 access control checks per resource. There are 26 resources in the test. That's 2496 checks. If we ignore the setup time, that's 8.9 checks per second, or 112ms per check. That's...not fast. But this is a debug build, too. That's probably not representative of the performance of a release build. Out of curiosity, I went to look at what Oso functions we see in the stack traces:
That adds up to 5448, which convinces me further that most of our stacks do have Oso on the scene. Here's the raw data: role-stacks-data.zip. |
I've done this in the past for tests that did a lot of number crunching; I've seen on the order of 100x in at least one case (some data science matrix math stuff). On that project we ended up enabling optimizations in debug builds to avoid the pitfall of "I ran I don't know what oso is doing under the hood, but on my machine there's about a 10x difference in this test for debug vs release:
|
@jgallagher thanks for that data! What I meant was: given how much longer it can take to compile things for release, would it wind up being a net win? I didn't realize how easy it was to try, and with a 10x win on runtime, seems worth trying. I'll test this now. |
Having never used
So that was 9m6s build time plus 28s runtime. More interesting is probably the incremental rebuild case. From that state:
At that point, if I do a
So the incremental rebuild case is very slightly better with |
Ouch. I wonder if there's a middle ground somewhere by fiddling with the different options in https://doc.rust-lang.org/cargo/reference/profiles.html. I kinda doubt it, and if there is I worry about fragility, but I'll try a couple "obvious" combinations and see what the times look like on my machine. |
For reference: from
So that's 14m45s, of which just over half was build time. I'm going to try with |
From
The build took a fair bit longer, the tests ran faster, and in total it took a little more wall-clock time. It's a lot more CPU time so on less beefy machines I'd expect this to take quite a lot longer. |
Would you mind retrying on your machine with this diff to @@ -61,6 +61,7 @@ resolver = "2"
[profile.dev]
panic = "abort"
+opt-level = 1
[profile.release]
panic = "abort" On my machine, a clean build + this test without that diff:
And with the diff:
The build time goes up ~60%, but the test time reduction more than makes up for that. I'll follow up with a full repo build+test on my machine and post those results shortly. |
With that diff, running what you ran:
Caveat: this isn't comparable to what I tested above though because the selection of the specific Another consideration: I rebuild much more often than I run tests. I don't know about others. I'm not sure a tradeoff of build time for test run time is worth it. This is a good discussion and I think it's worth spinning it out into a separate issue. As far as this test goes, my inclination is to include this test and potentially iterate on the build and test times unless I hear feedback that people would rather not do that. My thinking: tests are important, and there are tools for limiting which tests you run with each invocation, and in the meantime one can always run with |
With profile.dev.opt-level = 1, otherwise from this branch (so with this test):
So it looks like the build takes 80% longer (1.8x the time) and the total time is 10% longer (1.1x the time). |
I'm planning to resurrect this PR and hope to land it again soon. The runtime is down to 53s on my machine. I think this was after @plotnick convinced me to try parallelizing them. |
Okay, this is cleaned up quite a bit and better documented now. It also includes a coverage check with an exemption list so that we can avoid accidentally not testing future resources. (There are still quite a few resources we could test here but aren't.) |
tl;dr: This test seemed to reliably hang on Linux due to osohq/oso#1592. I've updated this branch to pull in Oso 0.26.2 and the test seems to pass reliably now. Details: the buildomat Ubuntu jobs for this PR and #1580 (which is based on this one) kept timing out after several hours. I feared this was some pathological performance problem since this test does take a while, but that was unfounded. This was reproducible on at least three other Linux systems (thanks @jgallagher and @plotnick), two Ubuntu and one Debian. In all three cases, we found the test stopped using CPU at some point partway through (varying how far it made it), suggesting a different kind of hang. The last log entries looked like this, when we had them:
Not surprisingly, we seemed to be stuck in authorization (since this is all this test does). At this point I suspected a problem with the database queries to load roles, but CockroachDB was behaving normally. And in retrospect, the "roles" message above is logged after all database queries and shortly before we enter Oso for the actual policy check. @jgallagher attached with gdb and observed that all the tokio runtime threads were blocked in std::sync::rwlock, "looks like 1 is trying to acquire for writing and the others are reading". The one trying to take the write lock is this one:
John found that lock acquisition is here: and it's been changed in 0.26.2 to a read lock: hence we think this is osohq/oso#1592. Updating this branch to 0.26.2 caused the test to pass on my system and John's system as well. There are some follow-ups here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic now! The test_iam_roles_behavior
test time is down to ~79 seconds on my 8-core machine, and compiling Oso with opt-level = 3
only reduces that by a few seconds. So I think we're now ok to run this test by default in CI. The safety net this test brings to the whole authz sub-system seems well worth the (now) modest cost, though we should definitely keep an eye on runtime as we work to shrink the exempted class list.
I don't have any specific comments on the code. The new comments are extremely helpful, the resource builder code and authz runners are straightforward, and the test output table is amazing (although the formatting does get slightly wonky in fonts where the ✔ character is double-width).
Thanks for tackling this and persevering to drive it over the line!
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619)
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619) Co-authored-by: Alan Hanson <alan@oxide.computer>
This PR adds a test that checks the permissions granted for all supported roles.