Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add unit test for authz policy #1123

Merged
merged 27 commits into from
Aug 12, 2022
Merged

add unit test for authz policy #1123

merged 27 commits into from
Aug 12, 2022

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented May 26, 2022

This PR adds a test that checks the permissions granted for all supported roles.

@davepacheco
Copy link
Collaborator Author

What the test does

The new test does the following:

  • sets up a DataStore and populates it as usual (does not set up the rest of Nexus)
  • sets up a hierarchy of about two dozen resources: two Silos, two Orgs in one Silo, two Projects in one Org, etc.
  • for each assignable role on each resource that supports roles, creates a test user in Silo1 / Org1 / Project1 with that role. This is about a dozen roles.
  • For each resource, for each user it created, for every possible action (of which there are 8), check if that user is allowed to perform that action on that resource

What the test verifies

This essentially checks the result of authorizing all possible actions on all possible roles ... for a subset of resources. The subset includes Fleet, Silos, Organizations, Projects, a couple of things inside Projects, and one resource that's two levels below Project (Vpc -> VpcSubnet). It's verifying:

  • the Oso policy file itself, including the snippets for all the resources tested
  • the implementation of the authz subsytem from authorize() down, including:
    • storing role assignments into the database
    • fetching them back out again during authorize()
    • the authz::* structs that get passed into Oso
  • Oso (at least the parts we're using)

This does not cover all possible resources though. I'd like it to, but there's a problem:

The problem: it takes 5 minutes

Already, this test takes almost 5 minutes on my machine. I'll drop some notes below about the performance issue, but I don't see low-hanging fruit.

What to do?

I'd like thoughts on where to go from here. I can see:

  • Add the test as-is and deal with the fact that the test suite will take a few minutes longer.
  • Add this as an "ignored" test so that people don't have to deal with this most of the time. This kind of defeats the point but it'd be better than not adding it at all.
  • Add this as an "ignored" test and have CI run the ignored tests too. There's at least one other ignored test I know of that really is a waste of time to run in CI but it only takes a few seconds so maybe that doesn't matter.
  • Create a stripped-down version of this that's not so exhaustive. I don't love this because this exhaustive test feels pretty critical for ensuring we don't ship something with a gaping security hole.

Other ideas?

There are a few things I can do to improve the performance of the test, but I don't think it'll help that much:

  • Parallelize the test. I'm not sure this will help much in CI, but it'll help people running locally with a bunch of cores. Keep in mind that unless you're running individual tests (in which case you could skip this one), it's already going to be parallelized with the other tests.
  • Try running the tests in release mode and see if they're any faster. I'd be surprised if this reduced the net total time since we're only running the test program once per compile the vast majority of the time.

@davepacheco
Copy link
Collaborator Author

Notes on the test's performance

While the test is running, prstat -mLc 1 shows the test process spending upwards of 96% of time on-CPU in userland. It's kind of impressive to do that for several minutes in a row! I was surprised -- I thought we'd be spending a lot of time on the database accesses.

I profiled it for 60s at 97 Hz with DTrace:

$ pfexec dtrace -n 'profile-97/pid == $1/{ @[ustack(160)] = count(); }' 2564 -c 'sleep 60' > ~/role-test-stacks.out

(note: 80 frames was not enough -- I needed 160 to capture everything)

I demangled this with:

demangle < ~/role-test-stacks.out > ~/role-test-stacks-demangled.out

with somewhat mixed results (some frames are now missing useful names, but others are clearer 🤷 )

I used Brendan's tools to make a flame graph:

./stackcollapse.pl < ../role-test-stacks-demangled.out > role-test-stacks-demangled-collapsed.out
./flamegraph.pl < role-test-stacks-demangled-collapsed.out > role-test-stacks-demangled-collapsed.svg

I'll attach the various data files and flame graph to this comment.

It's hard to read the flame graph, but we're clearly spending nearly all of our time in the tokio task that's doing the tests. (That's good? What else would we be doing?) Much of it seems to be inside Oso and I'm not sure why various nearly-identical stacks aren't identical. Let's quantify the Oso bit:

$ grep Oso role-test-stacks-demangled-collapsed.out | awk '{sum += $NF} END { print sum }'
5661
$ grep -v Oso role-test-stacks-demangled-collapsed.out | awk '{sum += $NF} END { print sum }'
429

role-test-stacks-demangled-collapsed

The flame graph tool reports there are 6,090 samples, which matches the result above. This is worth expanding on a bit: profiling on all CPUs for 60s at 97 Hz, for a single-threaded process that was pegged on CPU, we'd expect 60 * 97 = 5,820 samples. In fact, the flame graph shows a second tokio task that's on CPU for 386 samples -- it's the slog async drain. If we assume that's on a separate thread (which I don't know for sure, but seems likely because it will almost always have some work to do) and subtract its 386 samples from the 6,090 total, that leaves 5,704. And we saw above that Oso is on the stack for 5,661 of them -- 99.2%. I think that means Oso is responsible for around 99% of the total wall-clock execution time of this test. (One caveat: the test takes a bit of time up front to set up the resource hierarchy, users, and policies. I didn't profile these. I think this was a very small fraction of the 4m40s that this test took.)

Again, this isn't so bad: this test is designed to do nothing but run authz checks, so it's not surprising that we'd be spending nearly all of our time in Oso. The bigger question is how long are these checks taking? The test is taking about 4m40s on my machine:

$ time EXPECTORATE=overwrite cargo test -p omicron-nexus --lib test_iam_roles -- --nocapture
...
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 280.86s


real    4m41.358s
user    4m49.785s
sys     0m20.216s

There are 12 test users times 8 actions = 96 access control checks per resource. There are 26 resources in the test. That's 2496 checks. If we ignore the setup time, that's 8.9 checks per second, or 112ms per check. That's...not fast. But this is a debug build, too. That's probably not representative of the performance of a release build.

Out of curiosity, I went to look at what Oso functions we see in the stack traces:

$ grep Oso ../role-test-stacks-demangled.out | sort | uniq -c | sort -n
   1               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<alloc::sync::Arc<dyn core::ops::function::Fn<(&oso::host::Host,&oso::host::class::Instance)>+Output = core::result::Result<oso::host::to_polar::PolarIterator,oso::errors::OsoError>+core::marker::Sync+core::marker::Send>>::h47af701b96b7388f+0x11
   1               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<oso::errors::OsoError>::h8cd859003d5efa78+0x21
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0x1ea
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0x1ea
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x2a2
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x19e
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x41e
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0xe0
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x101
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hcce1b32e95043877+0x69
   1               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0xc2
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0x1ea
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x1ea
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::h0c23daa132460bfc+0x3be
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x400
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::haf578c0609bff9c6+0x69
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hcce1b32e95043877+0x3be
   2               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::{{closure}}::h01d43ee439853215+0x2a
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0xda
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::h0c23daa132460bfc+0x69
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0x17d
   3               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hd9b4433cc238d01a+0x3be
   4               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0xda
   4               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::hfd2f76ecee18de19+0x3be
   5               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0xda
   5               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0xda
   6               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0x1ea
   6               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x267
   7               omicron_nexus-90c52618a6169540`core::ptr::drop_in_place<core::result::Result<oso::query::ResultSet,oso::errors::OsoError>>::h35abe4bd17133a70+0x23
   7               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0xda
   9               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x69
  15               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x1af
  27               omicron_nexus-90c52618a6169540`oso::oso::Oso::query_rule::ha5ee00fb11471a76+0x3df
  39               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0x9f
 202               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h45cfacf8173a302b+0xcc
 530               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h20e4e2ab00edc51f+0x107
 937               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h61ff9676031ca5bd+0x107
 967               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h18710b9cd24a786f+0x107
1206               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::hf79d2cd23a698acd+0x107
1435               omicron_nexus-90c52618a6169540`oso::oso::Oso::is_allowed::h54d9946422a4c23b+0x107

That adds up to 5448, which convinces me further that most of our stacks do have Oso on the scene.

Here's the raw data: role-stacks-data.zip.

@jgallagher
Copy link
Contributor

Try running the tests in release mode and see if they're any faster. I'd be surprised if this reduced the net total time since we're only running the test program once per compile the vast majority of the time.

I've done this in the past for tests that did a lot of number crunching; I've seen on the order of 100x in at least one case (some data science matrix math stuff). On that project we ended up enabling optimizations in debug builds to avoid the pitfall of "I ran cargo test but forgot --release" leading to unreasonably long test times.

I don't know what oso is doing under the hood, but on my machine there's about a 10x difference in this test for debug vs release:

% cargo test -p omicron-nexus --lib -- test_iam_roles
     Finished test [unoptimized + debuginfo] target(s) in 0.25s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-4042e9c34cfd993f)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 388.33s
% cargo test --release -p omicron-nexus --lib -- test_iam_roles
    Finished release [optimized] target(s) in 0.25s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-3c649c30c0ef4afd)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 37.31s

@davepacheco
Copy link
Collaborator Author

@jgallagher thanks for that data! What I meant was: given how much longer it can take to compile things for release, would it wind up being a net win? I didn't realize how easy it was to try, and with a 10x win on runtime, seems worth trying. I'll test this now.

@davepacheco
Copy link
Collaborator Author

Having never used --release in this workspace before:

$ time cargo test --release -p omicron-nexus --lib test_iam_roles
...
    Finished release [optimized] target(s) in 9m 06s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-0d61ab037044f24f)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 28.41s


real    9m35.483s
user    67m14.728s
sys     4m32.658s

So that was 9m6s build time plus 28s runtime. More interesting is probably the incremental rebuild case. From that state:

$ time cargo test --release -p omicron-nexus --lib test_iam_roles
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron-auth/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron-auth/nexus/test-utils)
    Finished release [optimized] target(s) in 6m 05s
     Running unittests src/lib.rs (target/release/deps/omicron_nexus-0d61ab037044f24f)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 28.26s


real    6m33.714s
user    22m12.150s
sys     0m37.857s

At that point, if I do a cargo test -p omicron-nexus --lib test_iam_roles to get the build to the same state, then touch the file again, then run it again, it takes:

$ time cargo test -p omicron-nexus --lib test_iam_roles
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron-auth/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron-auth/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 38s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-90c52618a6169540)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 302.86s


real    6m41.359s
user    6m34.500s
sys     0m38.517s

So the incremental rebuild case is very slightly better with --release, but only very slightly. I'm not sure how this might differ on other people's machines. I gather mine is a bit beefier than some others, especially the CI ones. Although for CI we only really care about full builds.

@jgallagher
Copy link
Contributor

Ouch. I wonder if there's a middle ground somewhere by fiddling with the different options in https://doc.rust-lang.org/cargo/reference/profiles.html. I kinda doubt it, and if there is I worry about fragility, but I'll try a couple "obvious" combinations and see what the times look like on my machine.

@davepacheco
Copy link
Collaborator Author

For reference: from cargo clean, a full cargo test (on this branch, so with this extra test) on my machine took:

$ time cargo test
...
   Compiling omicron-deploy v0.1.0 (/home/dap/omicron-auth/deploy)
    Finished test [unoptimized + debuginfo] target(s) in 7m 54s
...
real    14m45.738s
user    73m32.556s
sys     13m9.026s

So that's 14m45s, of which just over half was build time. I'm going to try with --release next.

@davepacheco
Copy link
Collaborator Author

From cargo clean, a full cargo test --release takes:

$ time cargo test --release
...
    Finished release [optimized] target(s) in 13m 19s
     Running unittests src/lib.rs (target/release/deps/authz_macros-ab606f9909e80a5f)
...
real    15m11.469s
user    162m56.369s
sys     10m36.408s

The build took a fair bit longer, the tests ran faster, and in total it took a little more wall-clock time. It's a lot more CPU time so on less beefy machines I'd expect this to take quite a lot longer.

@jgallagher
Copy link
Contributor

Would you mind retrying on your machine with this diff to Cargo.toml to enable "basic optimizations" in debug/test builds?

@@ -61,6 +61,7 @@ resolver = "2"

 [profile.dev]
 panic = "abort"
+opt-level = 1

 [profile.release]
 panic = "abort"

On my machine, a clean build + this test without that diff:

% cargo clean
% time cargo test -p omicron-nexus --lib test_iam_roles
... snip ...
    Finished test [unoptimized + debuginfo] target(s) in 4m 10s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-4042e9c34cfd993f)

running 1 test
test authz::policy_test::test_iam_roles has been running for over 60 seconds
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 434.83s

cargo test -p omicron-nexus --lib test_iam_roles  1397.79s user 107.14s system 219% cpu 11:25.70 total

And with the diff:

% cargo clean
% time cargo test -p omicron-nexus --lib test_iam_roles
... snip ...
    Finished test [optimized + debuginfo] target(s) in 6m 47s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-3d7473c8a565ed97)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 49.02s

cargo test -p omicron-nexus --lib test_iam_roles  2531.35s user 140.94s system 584% cpu 7:36.99 total

The build time goes up ~60%, but the test time reduction more than makes up for that. I'll follow up with a full repo build+test on my machine and post those results shortly.

@davepacheco
Copy link
Collaborator Author

With that diff, running what you ran:

dap@ivanova omicron-auth $ cargo clean
dap@ivanova omicron-auth $ time cargo test -p omicron-nexus --lib test_iam_roles
...
    Finished test [optimized + debuginfo] target(s) in 9m 44s
     Running unittests src/lib.rs (target/debug/deps/omicron_nexus-a11c97039aa75185)

running 1 test
test authz::policy_test::test_iam_roles ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 86 filtered out; finished in 32.14s


real    10m16.506s
user    76m25.770s
sys     5m27.777s

Caveat: this isn't comparable to what I tested above though because the selection of the specific omicron-nexus package affects the Cargo features that are used when building dependencies. I'm going to run full builds with this to compare them.


Another consideration: I rebuild much more often than I run tests. I don't know about others. I'm not sure a tradeoff of build time for test run time is worth it.


This is a good discussion and I think it's worth spinning it out into a separate issue. As far as this test goes, my inclination is to include this test and potentially iterate on the build and test times unless I hear feedback that people would rather not do that. My thinking: tests are important, and there are tools for limiting which tests you run with each invocation, and in the meantime one can always run with --release if that's better for their workflow.

@davepacheco
Copy link
Collaborator Author

With profile.dev.opt-level = 1, otherwise from this branch (so with this test):

dap@ivanova omicron-auth $ cargo clean
dap@ivanova omicron-auth $ time cargo test
...
    Finished test [optimized + debuginfo] target(s) in 14m 16s
...
real    16m16.135s
user    174m6.259s
sys     13m21.759s

So it looks like the build takes 80% longer (1.8x the time) and the total time is 10% longer (1.1x the time).

@davepacheco
Copy link
Collaborator Author

I'm planning to resurrect this PR and hope to land it again soon. The runtime is down to 53s on my machine. I think this was after @plotnick convinced me to try parallelizing them.

@davepacheco
Copy link
Collaborator Author

Okay, this is cleaned up quite a bit and better documented now. It also includes a coverage check with an exemption list so that we can avoid accidentally not testing future resources. (There are still quite a few resources we could test here but aren't.)

@davepacheco davepacheco marked this pull request as ready for review August 11, 2022 19:00
@davepacheco davepacheco requested a review from plotnick August 11, 2022 20:46
@davepacheco
Copy link
Collaborator Author

tl;dr: This test seemed to reliably hang on Linux due to osohq/oso#1592. I've updated this branch to pull in Oso 0.26.2 and the test seems to pass reliably now.

Details: the buildomat Ubuntu jobs for this PR and #1580 (which is based on this one) kept timing out after several hours. I feared this was some pathological performance problem since this test does take a while, but that was unfounded. This was reproducible on at least three other Linux systems (thanks @jgallagher and @plotnick), two Ubuntu and one Debian. In all three cases, we found the test stopped using CPU at some point partway through (varying how far it made it), suggesting a different kind of hang. The last log entries looked like this, when we had them:

[2022-08-12T16:41:25.124804803Z] TRACE: test_iam_roles/3397479 on oxidian: loading roles (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, resource_id=80031baf-c8db-41c2-9842-a5e1db498df5, resource_type=Project)
    actor: Actor::SiloUser { silo_user_id: f2c0d751-1c11-45dd-9fb9-6d105d4ed583, silo_id: c655ac38-4d06-44ff-a503-435bf969df4f, .. }
[2022-08-12T16:41:25.125062942Z] TRACE: test_iam_roles/3397479 on oxidian: authorize begin (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, resource=Database, action=Query)
    actor: Some(Actor::SiloUser { silo_user_id: f2c0d751-1c11-45dd-9fb9-6d105d4ed583, silo_id: c655ac38-4d06-44ff-a503-435bf969df4f, .. })
[2022-08-12T16:41:25.12538266Z] DEBUG: test_iam_roles/3397479 on oxidian: roles (username=fleet-admin, user_id=f2c0d751-1c11-45dd-9fb9-6d105d4ed583, roles="RoleSet { roles: {} }")

Not surprisingly, we seemed to be stuck in authorization (since this is all this test does). At this point I suspected a problem with the database queries to load roles, but CockroachDB was behaving normally. And in retrospect, the "roles" message above is logged after all database queries and shortly before we enter Oso for the actual policy check.

@jgallagher attached with gdb and observed that all the tokio runtime threads were blocked in std::sync::rwlock, "looks like 1 is trying to acquire for writing and the others are reading". The one trying to take the write lock is this one:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000056520c17ad05 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/futex.rs:36
#2  0x000056520c17eb0c in std::sys::unix::locks::futex_rwlock::RwLock::read_contended () at library/std/src/sys/unix/locks/futex_rwlock.rs:136
#3  0x000056520aac7638 in std::sys::unix::locks::futex_rwlock::RwLock::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sys/unix/locks/futex_rwlock.rs:85
#4  0x000056520a8dfd8a in std::sys_common::rwlock::MovableRwLock::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sys_common/rwlock.rs:84
#5  0x000056520aac7843 in std::sync::rwlock::RwLock<T>::read (self=0x7f087008b4b0) at /rustc/082e4ca49770ebc9cb0ee616f3726a67471be8cb/library/std/src/sync/rwlock.rs:200
#6  0x000056520a996220 in polar_core::vm::PolarVirtualMachine::rename_rule_vars (self=0x7f087485cd18, rule=0x7f08700585c0) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:750
#7  0x000056520a9c4913 in polar_core::vm::PolarVirtualMachine::filter_rules (self=0x7f087485cd18, applicable_rules=0x7f0840009ba0, unfiltered_rules=0x7f0840009bb8, args=0x7f0840009b88)
    at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:2527
#8  0x000056520a99197a in polar_core::vm::PolarVirtualMachine::next (self=0x7f087485cd18, goal=...) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:508
#9  0x000056520a9c9ef8 in <polar_core::vm::PolarVirtualMachine as polar_core::runnable::Runnable>::run (self=0x7f087485cd18) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/vm.rs:2871
#10 0x000056520aa01ce1 in polar_core::query::Query::next_event (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/query.rs:39
#11 0x000056520aa02787 in <polar_core::query::Query as core::iter::traits::iterator::Iterator>::next (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/polar-core-0.26.1/src/query.rs:121
#12 0x000056520a8373f3 in oso::query::Query::next_result (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/query.rs:40
#13 0x000056520a8371f7 in <oso::query::Query as core::iter::traits::iterator::Iterator>::next (self=0x7f087485cd00) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/query.rs:14
#14 0x00005652082532ad in oso::oso::Oso::is_allowed (self=0x7f087022c460, actor=..., action=omicron_nexus::authz::oso_generic::Action::Query, resource=...) at /home/john/.cargo/registry/src/github.com-1ecc6299db9ec823/oso-0.26.1/src/oso.rs:79
#15 0x0000565208b08be5 in omicron_nexus::authz::context::Authz::is_allowed (self=0x7f087022c460, actor=0x7f087485dc00, action=omicron_nexus::authz::oso_generic::Action::Query, resource=0x7f0840008350) at nexus/src/authz/context.rs:51

John found that lock acquisition is here:
https://docs.rs/polar-core/0.26.1/src/polar_core/polar.rs.html#212

and it's been changed in 0.26.2 to a read lock:
https://docs.rs/polar-core/0.26.2/src/polar_core/polar.rs.html#212

hence we think this is osohq/oso#1592. Updating this branch to 0.26.2 caused the test to pass on my system and John's system as well.

There are some follow-ups here:

Copy link
Contributor

@plotnick plotnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic now! The test_iam_roles_behavior test time is down to ~79 seconds on my 8-core machine, and compiling Oso with opt-level = 3 only reduces that by a few seconds. So I think we're now ok to run this test by default in CI. The safety net this test brings to the whole authz sub-system seems well worth the (now) modest cost, though we should definitely keep an eye on runtime as we work to shrink the exempted class list.

I don't have any specific comments on the code. The new comments are extremely helpful, the resource builder code and authz runners are straightforward, and the test output table is amazing (although the formatting does get slightly wonky in fonts where the ✔ character is double-width).

Thanks for tackling this and persevering to drive it over the line!

@davepacheco davepacheco merged commit d1fbdd2 into main Aug 12, 2022
@davepacheco davepacheco deleted the authz-policy-test branch August 12, 2022 22:00
leftwo pushed a commit that referenced this pull request Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110)
Move `FramedWrite` work to a separate task (#1145)
Use fewer borrows in ExtentInner API (#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128)
Ignore client messages after stopping the IO task (#1126)
Move client IO task into a struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623)
PHD: fix `cargo xtask phd` tidy not doing anything (#630)
PHD: add documentation for `cargo xtask phd` (#629)
standalone: improve virtual device creation errors (#632)
phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)
leftwo added a commit that referenced this pull request Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite`
work to a separate task (#1145) Use fewer borrows in ExtentInner API
(#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128) Ignore client
messages after stopping the IO task (#1126) Move client IO task into a
struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623) PHD: fix `cargo
xtask phd` tidy not doing anything (#630) PHD: add documentation for
`cargo xtask phd` (#629) standalone: improve virtual device creation
errors (#632) phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants