add unit test for authz policy #1123

davepacheco · 2022-05-26T22:14:03Z

This PR adds a test that checks the permissions granted for all supported roles.

davepacheco · 2022-05-26T22:23:04Z

What the test does

The new test does the following:

sets up a DataStore and populates it as usual (does not set up the rest of Nexus)
sets up a hierarchy of about two dozen resources: two Silos, two Orgs in one Silo, two Projects in one Org, etc.
for each assignable role on each resource that supports roles, creates a test user in Silo1 / Org1 / Project1 with that role. This is about a dozen roles.
For each resource, for each user it created, for every possible action (of which there are 8), check if that user is allowed to perform that action on that resource

What the test verifies

This essentially checks the result of authorizing all possible actions on all possible roles ... for a subset of resources. The subset includes Fleet, Silos, Organizations, Projects, a couple of things inside Projects, and one resource that's two levels below Project (Vpc -> VpcSubnet). It's verifying:

the Oso policy file itself, including the snippets for all the resources tested
the implementation of the authz subsytem from authorize() down, including:
- storing role assignments into the database
- fetching them back out again during authorize()
- the authz::* structs that get passed into Oso
Oso (at least the parts we're using)

This does not cover all possible resources though. I'd like it to, but there's a problem:

The problem: it takes 5 minutes

Already, this test takes almost 5 minutes on my machine. I'll drop some notes below about the performance issue, but I don't see low-hanging fruit.

What to do?

I'd like thoughts on where to go from here. I can see:

Add the test as-is and deal with the fact that the test suite will take a few minutes longer.
Add this as an "ignored" test so that people don't have to deal with this most of the time. This kind of defeats the point but it'd be better than not adding it at all.
Add this as an "ignored" test and have CI run the ignored tests too. There's at least one other ignored test I know of that really is a waste of time to run in CI but it only takes a few seconds so maybe that doesn't matter.
Create a stripped-down version of this that's not so exhaustive. I don't love this because this exhaustive test feels pretty critical for ensuring we don't ship something with a gaping security hole.

Other ideas?

There are a few things I can do to improve the performance of the test, but I don't think it'll help that much:

Parallelize the test. I'm not sure this will help much in CI, but it'll help people running locally with a bunch of cores. Keep in mind that unless you're running individual tests (in which case you could skip this one), it's already going to be parallelized with the other tests.
Try running the tests in release mode and see if they're any faster. I'd be surprised if this reduced the net total time since we're only running the test program once per compile the vast majority of the time.

davepacheco · 2022-05-26T22:36:57Z

Notes on the test's performance

While the test is running, prstat -mLc 1 shows the test process spending upwards of 96% of time on-CPU in userland. It's kind of impressive to do that for several minutes in a row! I was surprised -- I thought we'd be spending a lot of time on the database accesses.

I profiled it for 60s at 97 Hz with DTrace:

$ pfexec dtrace -n 'profile-97/pid == $1/{ @[ustack(160)] = count(); }' 2564 -c 'sleep 60' > ~/role-test-stacks.out

(note: 80 frames was not enough -- I needed 160 to capture everything)

I demangled this with:

demangle < ~/role-test-stacks.out > ~/role-test-stacks-demangled.out

with somewhat mixed results (some frames are now missing useful names, but others are clearer 🤷 )

I used Brendan's tools to make a flame graph:

./stackcollapse.pl < ../role-test-stacks-demangled.out > role-test-stacks-demangled-collapsed.out
./flamegraph.pl < role-test-stacks-demangled-collapsed.out > role-test-stacks-demangled-collapsed.svg

I'll attach the various data files and flame graph to this comment.

It's hard to read the flame graph, but we're clearly spending nearly all of our time in the tokio task that's doing the tests. (That's good? What else would we be doing?) Much of it seems to be inside Oso and I'm not sure why various nearly-identical stacks aren't identical. Let's quantify the Oso bit:

$ grep Oso role-test-stacks-demangled-collapsed.out | awk '{sum += $NF} END { print sum }'
5661
$ grep -v Oso role-test-stacks-demangled-collapsed.out | awk '{sum += $NF} END { print sum }'
429

The flame graph tool reports there are 6,090 samples, which matches the result above. This is worth expanding on a bit: profiling on all CPUs for 60s at 97 Hz, for a single-threaded process that was pegged on CPU, we'd expect 60 * 97 = 5,820 samples. In fact, the flame graph shows a second tokio task that's on CPU for 386 samples -- it's the slog async drain. If we assume that's on a separate thread (which I don't know for sure, but seems likely because it will almost always have some work to do) and subtract its 386 samples from the 6,090 total, that leaves 5,704. And we saw above that Oso is on the stack for 5,661 of them -- 99.2%. I think that means Oso is responsible for around 99% of the total wall-clock execution time of this test. (One caveat: the test takes a bit of time up front to set up the resource hierarchy, users, and policies. I didn't profile these. I think this was a very small fraction of the 4m40s that this test took.)

Again, this isn't so bad: this test is designed to do nothing but run authz checks, so it's not surprising that we'd be spending nearly all of our time in Oso. The bigger question is how long are these checks taking? The test is taking about 4m40s on my machine:

$ time EXPECTORATE=overwrite cargo test -p omicron-nexus --lib test_iam_roles -- --nocapture
...
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 280.86s


real    4m41.358s
user    4m49.785s
sys     0m20.216s

There are 12 test users times 8 actions = 96 access control checks per resource. There are 26 resources in the test. That's 2496 checks. If we ignore the setup time, that's 8.9 checks per second, or 112ms per check. That's...not fast. But this is a debug build, too. That's probably not representative of the performance of a release build.

Out of curiosity, I went to look at what Oso functions we see in the stack traces:

$ grep Oso ../role-test-stacks-demangled.out | sort | uniq -c | sort -n
   1               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<alloc::sync::Arc<dyn core::ops::function::Fn<(&oso::host::Host,&oso::host::class::Instance)>+Output = core::result::Result<oso::host::to_polar::PolarIterator,oso::errors::OsoError>+core::marker::Sync+core::marker::Send>>::h47af701b96b7388f+0x11
   1               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<oso::errors::OsoError>::h8cd859003d5efa78+0x21
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0x1ea
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0x1ea
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x2a2
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x19e
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x41e
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0xe0
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x101
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hcce1b32e95043877+0x69
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0xc2
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0x1ea
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x1ea
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::h0c23daa132460bfc+0x3be
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x400
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x69
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hcce1b32e95043877+0x3be
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::{{closure}}::h01d43ee439853215+0x2a
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0xda
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::h0c23daa132460bfc+0x69
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0x17d
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0x3be
   4               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0xda
   4               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hfd2f76ecee18de19+0x3be
   5               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0xda
   5               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0xda
   6               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0x1ea
   6               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x267
   7               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<core::result::Result<oso::query::ResultSet,oso::errors::OsoError>>::h35abe4bd17133a70+0x23
   7               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0xda
   9               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x69
  15               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x1af
  27               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x3df
  39               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x9f
 202               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0xcc
 530               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0x107
 937               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0x107
 967               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0x107
1206               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x107
1435               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0x107

That adds up to 5448, which convinces me further that most of our stacks do have Oso on the scene.

Here's the raw data: role-stacks-data.zip.

jgallagher · 2022-05-27T14:14:33Z

Try running the tests in release mode and see if they're any faster. I'd be surprised if this reduced the net total time since we're only running the test program once per compile the vast majority of the time.

I've done this in the past for tests that did a lot of number crunching; I've seen on the order of 100x in at least one case (some data science matrix math stuff). On that project we ended up enabling optimizations in debug builds to avoid the pitfall of "I ran cargo test but forgot --release" leading to unreasonably long test times.

I don't know what oso is doing under the hood, but on my machine there's about a 10x difference in this test for debug vs release:

% cargo test -p omicron-nexus --lib -- test_iam_roles
     Finished test [unoptimized + debuginfo] target(s) in 0.25s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-4042e9c34cfd993f)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 388.33s

% cargo test --release -p omicron-nexus --lib -- test_iam_roles
    Finished release [optimized] target(s) in 0.25s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-3c649c30c0ef4afd)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 37.31s

davepacheco · 2022-05-27T16:22:07Z

@jgallagher thanks for that data! What I meant was: given how much longer it can take to compile things for release, would it wind up being a net win? I didn't realize how easy it was to try, and with a 10x win on runtime, seems worth trying. I'll test this now.

davepacheco · 2022-05-27T16:50:42Z

Having never used --release in this workspace before:

$ time cargo test --release -p omicron-nexus --lib test_iam_roles
...
    Finished release [optimized] target(s) in 9m 06s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-0d61ab037044f24f)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 28.41s


real    9m35.483s
user    67m14.728s
sys     4m32.658s

So that was 9m6s build time plus 28s runtime. More interesting is probably the incremental rebuild case. From that state:

$ time cargo test --release -p omicron-nexus --lib test_iam_roles
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron-auth/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron-auth/nexus/test-utils)
    Finished release [optimized] target(s) in 6m 05s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-0d61ab037044f24f)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 28.26s


real    6m33.714s
user    22m12.150s
sys     0m37.857s

At that point, if I do a cargo test -p omicron-nexus --lib test_iam_roles to get the build to the same state, then touch the file again, then run it again, it takes:

$ time cargo test -p omicron-nexus --lib test_iam_roles
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron-auth/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron-auth/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 38s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-90c52618a6169540)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 302.86s


real    6m41.359s
user    6m34.500s
sys     0m38.517s

So the incremental rebuild case is very slightly better with --release, but only very slightly. I'm not sure how this might differ on other people's machines. I gather mine is a bit beefier than some others, especially the CI ones. Although for CI we only really care about full builds.

jgallagher · 2022-05-27T16:56:08Z

Ouch. I wonder if there's a middle ground somewhere by fiddling with the different options in https://doc.rust-lang.org/cargo/reference/profiles.html. I kinda doubt it, and if there is I worry about fragility, but I'll try a couple "obvious" combinations and see what the times look like on my machine.

davepacheco · 2022-05-27T17:06:59Z

For reference: from cargo clean, a full cargo test (on this branch, so with this extra test) on my machine took:

$ time cargo test
...
   Compiling omicron-deploy v0.1.0 (/home/dap/omicron-auth/deploy)
    Finished test [unoptimized + debuginfo] target(s) in 7m 54s
...
real    14m45.738s
user    73m32.556s
sys     13m9.026s

So that's 14m45s, of which just over half was build time. I'm going to try with --release next.

davepacheco · 2022-05-27T17:24:10Z

From cargo clean, a full cargo test --release takes:

$ time cargo test --release
...
    Finished release [optimized] target(s) in 13m 19s
     Running unittests src/lib.rs (target/release/deps/authz_macros-ab606f9909e80a5f)
...
real    15m11.469s
user    162m56.369s
sys     10m36.408s

The build took a fair bit longer, the tests ran faster, and in total it took a little more wall-clock time. It's a lot more CPU time so on less beefy machines I'd expect this to take quite a lot longer.

jgallagher · 2022-05-27T17:36:56Z

Would you mind retrying on your machine with this diff to Cargo.toml to enable "basic optimizations" in debug/test builds?

@@ -61,6 +61,7 @@ resolver = "2"

 [profile.dev]
 panic = "abort"
+opt-level = 1

 [profile.release]
 panic = "abort"

On my machine, a clean build + this test without that diff:

% cargo clean
% time cargo test -p omicron-nexus --lib test_iam_roles
... snip ...
    Finished test [unoptimized + debuginfo] target(s) in 4m 10s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-4042e9c34cfd993f)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 434.83s

cargo test -p omicron-nexus --lib test_iam_roles  1397.79s user 107.14s system 219% cpu 11:25.70 total

And with the diff:

% cargo clean
% time cargo test -p omicron-nexus --lib test_iam_roles
... snip ...
    Finished test [optimized + debuginfo] target(s) in 6m 47s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-3d7473c8a565ed97)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 49.02s

cargo test -p omicron-nexus --lib test_iam_roles  2531.35s user 140.94s system 584% cpu 7:36.99 total

The build time goes up ~60%, but the test time reduction more than makes up for that. I'll follow up with a full repo build+test on my machine and post those results shortly.

davepacheco · 2022-05-27T18:13:22Z

With that diff, running what you ran:

dap@ivanova omicron-auth $ cargo clean
dap@ivanova omicron-auth $ time cargo test -p omicron-nexus --lib test_iam_roles
...
    Finished test [optimized + debuginfo] target(s) in 9m 44s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-a11c97039aa75185)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 32.14s


real    10m16.506s
user    76m25.770s
sys     5m27.777s

Caveat: this isn't comparable to what I tested above though because the selection of the specific omicron-nexus package affects the Cargo features that are used when building dependencies. I'm going to run full builds with this to compare them.

Another consideration: I rebuild much more often than I run tests. I don't know about others. I'm not sure a tradeoff of build time for test run time is worth it.

This is a good discussion and I think it's worth spinning it out into a separate issue. As far as this test goes, my inclination is to include this test and potentially iterate on the build and test times unless I hear feedback that people would rather not do that. My thinking: tests are important, and there are tools for limiting which tests you run with each invocation, and in the meantime one can always run with --release if that's better for their workflow.

davepacheco · 2022-05-27T18:27:18Z

With profile.dev.opt-level = 1, otherwise from this branch (so with this test):

dap@ivanova omicron-auth $ cargo clean
dap@ivanova omicron-auth $ time cargo test
...
    Finished test [optimized + debuginfo] target(s) in 14m 16s
...
real    16m16.135s
user    174m6.259s
sys     13m21.759s

So it looks like the build takes 80% longer (1.8x the time) and the total time is 10% longer (1.1x the time).

davepacheco · 2022-08-09T23:20:46Z

I'm planning to resurrect this PR and hope to land it again soon. The runtime is down to 53s on my machine. I think this was after @plotnick convinced me to try parallelizing them.

davepacheco · 2022-08-11T19:00:40Z

Okay, this is cleaned up quite a bit and better documented now. It also includes a coverage check with an exemption list so that we can avoid accidentally not testing future resources. (There are still quite a few resources we could test here but aren't.)

davepacheco · 2022-08-12T17:48:07Z

tl;dr: This test seemed to reliably hang on Linux due to osohq/oso#1592. I've updated this branch to pull in Oso 0.26.2 and the test seems to pass reliably now.

Details: the buildomat Ubuntu jobs for this PR and #1580 (which is based on this one) kept timing out after several hours. I feared this was some pathological performance problem since this test does take a while, but that was unfounded. This was reproducible on at least three other Linux systems (thanks @jgallagher and @plotnick), two Ubuntu and one Debian. In all three cases, we found the test stopped using CPU at some point partway through (varying how far it made it), suggesting a different kind of hang. The last log entries looked like this, when we had them:

[2022-08-12T16:41:25.124804803Z] TRACE: test_iam_roles/3397479 on oxidian: loading roles (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, resource_id=80031baf-c8db-41c2-9842-a5e1db498df5, resource_type=Project)
    actor: Actor::SiloUser { silo_user_id: f2c0d751-1c11-45dd-9fb9-6d105d4ed583, silo_id: c655ac38-4d06-44ff-a503-435bf969df4f, .. }
[2022-08-12T16:41:25.125062942Z] TRACE: test_iam_roles/3397479 on oxidian: authorize begin (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, resource=Database, action=Query)
    actor: Some(Actor::SiloUser { silo_user_id: f2c0d751-1c11-45dd-9fb9-6d105d4ed583, silo_id: c655ac38-4d06-44ff-a503-435bf969df4f, .. })
[2022-08-12T16:41:25.12538266Z] DEBUG: test_iam_roles/3397479 on oxidian: roles (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, roles="RoleSet { roles: {} }")

Not surprisingly, we seemed to be stuck in authorization (since this is all this test does). At this point I suspected a problem with the database queries to load roles, but CockroachDB was behaving normally. And in retrospect, the "roles" message above is logged after all database queries and shortly before we enter Oso for the actual policy check.

@jgallagher attached with gdb and observed that all the tokio runtime threads were blocked in std::sync::rwlock, "looks like 1 is trying to acquire for writing and the others are reading". The one trying to take the write lock is this one:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000056520c17ad05 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/futex.rs:36
#2  0x000056520c17eb0c in std::sys::unix::locks::futex_rwlock::RwLock::read_contended () at library/std/src/sys/unix/locks/futex_rwlock.rs:136
#3  0x000056520aac7638 in std::sys::unix::locks::futex_rwlock::RwLock::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sys/unix/locks/futex_rwlock.rs:85
#4  0x000056520a8dfd8a in std::sys_common::rwlock::MovableRwLock::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sys_common/rwlock.rs:84
#5  0x000056520aac7843 in std::sync::rwlock::RwLock<T>::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sync/rwlock.rs:200
#6  0x000056520a996220 in polar_core::vm::PolarVirtualMachine::rename_rule_vars (self=0x7f087485cd18, rule=0x7f08700585c0) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:750
#7  0x000056520a9c4913 in polar_core::vm::PolarVirtualMachine::filter_rules (self=0x7f087485cd18, applicable_rules=0x7f0840009ba0, unfiltered_rules=0x7f0840009bb8, args=0x7f0840009b88)
    at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:2527
#8  0x000056520a99197a in polar_core::vm::PolarVirtualMachine::next (self=0x7f087485cd18, goal=...) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:508
#9  0x000056520a9c9ef8 in <polar_core::vm::PolarVirtualMachine as polar_core::runnable::Runnable>::run (self=0x7f087485cd18) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:2871
#10 0x000056520aa01ce1 in polar_core::query::Query::next_event (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/query.rs:39
#11 0x000056520aa02787 in <polar_core::query::Query as core::iter::traits::iterator::Iterator>::next (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/query.rs:121
#12 0x000056520a8373f3 in oso::query::Query::next_result (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/query.rs:40
#13 0x000056520a8371f7 in <oso::query::Query as core::iter::traits::iterator::Iterator>::next (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/query.rs:14
#14 0x00005652082532ad in oso::oso::Oso::is_allowed (self=0x7f087022c460, actor=..., action=omicron_nexus::authz::oso_generic::Action::Query, resource=...) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/oso.rs:79
#15 0x0000565208b08be5 in omicron_nexus::authz::context::Authz::is_allowed (self=0x7f087022c460, actor=0x7f087485dc00, action=omicron_nexus::authz::oso_generic::Action::Query, resource=0x7f0840008350) at nexus/src/authz/context.rs:51

John found that lock acquisition is here:
https://docs.rs/polar-core/0.26.1/src/polar_core/polar.rs.html#212

and it's been changed in 0.26.2 to a read lock:
https://docs.rs/polar-core/0.26.2/src/polar_core/polar.rs.html#212

hence we think this is osohq/oso#1592. Updating this branch to 0.26.2 caused the test to pass on my system and John's system as well.

There are some follow-ups here:

I want to check with the Oso folks about 0.26.2 because it doesn't appear on their list of releases.
I also want to check with them about the GItHub release page lacking release notes. We'd previously asked for this under add release notes (or a link to them) in GitHub releases? osohq/oso#1429 and that was working up through v0.26.0.
I had previously assumed that calling Oso's is_authorized() could not block at all. Now we know it takes blocking locks. We should not be doing this from tokio core threads. This needs to happen on blocking threads. I'll file a separate issue for this.

plotnick

This looks fantastic now! The test_iam_roles_behavior test time is down to ~79 seconds on my 8-core machine, and compiling Oso with opt-level = 3 only reduces that by a few seconds. So I think we're now ok to run this test by default in CI. The safety net this test brings to the whole authz sub-system seems well worth the (now) modest cost, though we should definitely keep an eye on runtime as we work to shrink the exempted class list.

I don't have any specific comments on the code. The new comments are extremely helpful, the resource builder code and authz runners are straightforward, and the test output table is amazing (although the formatting does get slightly wonky in fonts where the ✔ character is double-width).

Thanks for tackling this and persevering to drive it over the line!

Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619)

Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619) Co-authored-by: Alan Hanson <alan@oxide.computer>

davepacheco added 3 commits May 25, 2022 16:52

some work

516719c

roll in proposed fix

dad744a

nits

f46c708

davepacheco added 7 commits June 15, 2022 12:42

Merge branch 'main' into authz-policy-test

8fd19f1

fix after merge with main

a1d12c0

parallelize test (ugly)

bbdcb79

add database tests

57136c5

test unauthenticated user

ad3e233

use clearer symbols

553487c

existing context tests are obviated by the new test

2b14b1a

davepacheco mentioned this pull request Jul 7, 2022

need better testing around minimum privileges #1374

Open

Merge branch 'main' into authz-policy-test

16a46c6

davepacheco added 6 commits August 9, 2022 21:11

add coverage test for IAM role policy test

7381322

break up the policy test into more manageable modules

98a9c57

move polar_class stuff around -- use dynamic dispatch

8b8cc53

refactor resource generation to be less ad hoc

02d3911

move specific resources out of generic builder stuff

de3b399

edits, docs

d56269a

davepacheco added 6 commits August 11, 2022 10:37

more docs, more cleanup

e2c48bc

move concrete resources to a separate module

2decfdd

move exemption list

ab6d3be

this rule is tested

c20ba05

remaining XXX-dap are covered elsewhere or already done

086bf94

add some more global resources

b580aa1

davepacheco marked this pull request as ready for review August 11, 2022 19:00

davepacheco added 2 commits August 11, 2022 12:01

remove last XXX

f4fc02c

add racks, sleds, NICs

7950489

davepacheco requested a review from plotnick August 11, 2022 20:46

Merge remote-tracking branch 'origin/main' into authz-policy-test

dbdddcd

davepacheco mentioned this pull request Aug 11, 2022

Fleet privileges should not cascade into Silos #1580

Merged

update to 0.26.2 for fix for oso#1592

5771c3c

This was referenced Aug 12, 2022

calls into Oso can block #1584

Open

build-and-test jobs time out after 8 hours, possibly a deadlock? #1576

Closed

plotnick approved these changes Aug 12, 2022

View reviewed changes

davepacheco merged commit d1fbdd2 into main Aug 12, 2022

davepacheco deleted the authz-policy-test branch August 12, 2022 22:00

This was referenced Aug 12, 2022

tracking issue for authz work #419

Closed

abandoned end-to-end role test #1609

Closed

leftwo mentioned this pull request Feb 9, 2024

Update Crucible and Propolis #5039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add unit test for authz policy #1123

add unit test for authz policy #1123

Uh oh!

davepacheco commented May 26, 2022 •

edited

Loading

Uh oh!

davepacheco commented May 26, 2022

Uh oh!

davepacheco commented May 26, 2022

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented Aug 9, 2022

Uh oh!

davepacheco commented Aug 11, 2022

Uh oh!

davepacheco commented Aug 12, 2022

Uh oh!

plotnick left a comment

Uh oh!

Uh oh!

add unit test for authz policy #1123

add unit test for authz policy #1123

Uh oh!

Conversation

davepacheco commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented May 26, 2022

What the test does

What the test verifies

The problem: it takes 5 minutes

What to do?

Uh oh!

davepacheco commented May 26, 2022

Notes on the test's performance

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

jgallagher commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented May 27, 2022

Uh oh!

davepacheco commented Aug 9, 2022

Uh oh!

davepacheco commented Aug 11, 2022

Uh oh!

davepacheco commented Aug 12, 2022

Uh oh!

plotnick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davepacheco commented May 26, 2022 •

edited

Loading