Skip to content

[DRAFT] Add support for Silo groups #1358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Aug 4, 2022
Merged

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Jul 6, 2022

Add Silo groups, and the ability to grant them roles.

Add the necessary logic to:

  • provision an "admin" silo group when a silo is created, which is
    granted silo admin role.
  • after successful authentication, create groups during silo user
    provision if the Silo's provision type is JIT.
  • add a group's roles to a user's role set if they're part of that
    group.

Silos now have an optional admin_group_name that is configured at silo
provision time. If this is left out, users will currently have no way to
be granted roles when they first log in. In the future, this may be
selected and groups would be created another way.

SAML identity providers now have an optional group_attribute_name that
configures what attribute represents a group name.

Add Silo groups, and the ability to grant them roles.

Add the necessary logic to:

- provision an "admin" silo group when a silo is created, which is
  granted silo admin role.
- after successful authentication, create groups during silo user
  provision if the Silo's provision type is JIT.
- add a group's roles to a user's role set if they're part of that
  group.

Silos now have an optional admin_group_name that is configured at silo
provision time. If this is left out, users will currently have no way to
be granted roles when they first log in. In the future, this may be
selected and groups would be created another way.

SAML identity providers now have an optional group_attribute_name that
configures what attribute represents a group name.
@jmpesp jmpesp requested a review from davepacheco July 6, 2022 03:12
@jmpesp
Copy link
Contributor Author

jmpesp commented Jul 8, 2022

This is ready for review now :)

@davepacheco davepacheco mentioned this pull request Jul 8, 2022
69 tasks
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I gather we're also going to need an API to list groups in the Silo, groups that you're a member of, and maybe members of a group that you're in. I'm not saying that has to be in this PR.

I gather that like users, we're going to identify groups in the API by their uuid and treat the external id like the user's display name -- that's what we show everywhere, but not what we use as the id.

silo_user_id
);

CREATE INDEX ON omicron.public.silo_group_membership (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need (silo_user_id, silo_group_id) here. It's a little weird to have two indexes whose columns are the same except for the order, but I don't see a way around it. We want to paginate the users in a group as well as the groups that a user is in and I think we need two indexes to do that.

) -> ListResultVec<SiloGroupMembership> {
opctx.authorize(authz::Action::ListChildren, authz_silo).await?;

use db::schema::silo_group_membership::dsl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to want to paginate this.

If we make the other change I suggested earlier, you won't need this function. I think it would be really useful to have APIs for a user to list their own groups and to list the users in a group. Those should be paginated.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through more details here but I wanted to leave this while I have it.

);

CREATE UNIQUE INDEX ON omicron.public.silo_group (
name,
CREATE INDEX ON omicron.public.silo_group (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this index for? (I'm generally suspicious of non-unique indexes because they can't be used for lookup and they can't be used for pagination when the page size is smaller than the number of duplicate rows in the index. I see that we're doing this a lot though and I filed #1497.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove it, I see a full table scan error for

DELETE  FROM "silo_group_membership" WHERE ("silo_group_membership"."silo_group_id" = ANY(SELECT "silo_group"."id" FROM "silo_group" WHERE (("silo_group"."silo_id" = $1) AND ("silo_group"."time_deleted" IS NOT NULL))))

meaning it's coming from the silo delete code:

        let updated_rows =
            diesel::delete(silo_group_membership::dsl::silo_group_membership)
                .filter(
                    silo_group_membership::dsl::silo_group_id.eq_any(
                        silo_group::dsl::silo_group
                            .filter(silo_group::dsl::silo_id.eq(id))
                            .filter(silo_group::dsl::time_deleted.is_not_null())
                            .select(silo_group::dsl::id),
                    ),
                )
                .execute_async(self.pool_authorized(opctx).await?)
                .await
                .map_err(|e| {
                    public_error_from_diesel_pool(
                        e,
                        ErrorHandler::NotFoundByResource(authz_silo),
                    )
                })?;

but there's also the silo group delete code:

                diesel::update(dsl::silo_group)
                    .filter(dsl::id.eq(group_id))
                    .filter(dsl::time_deleted.is_null())
                    .set(dsl::time_deleted.eq(Utc::now()))
                    .execute(conn)?;

which also would use this index.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. When those queries become paginated in the future, they're going to need the id column. I'd suggest making this a UNIQUE index on (silo_id, id). I don't think it makes any difference right now. But we've done that elsewhere for pagination and it'll avoid having to do a schema migration later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the indexes on silo_group are:

CREATE INDEX ON omicron.public.silo_group (
    silo_id
) WHERE
    time_deleted IS NOT NULL;

CREATE UNIQUE INDEX ON omicron.public.silo_group (
    silo_id,
    external_id
) WHERE
    time_deleted IS NULL;

With the column name change, I'm a little confused - are you suggesting changing the first one to UNIQUE on (silo_id, id), or (silo_id, external_id)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I missed a few things here, especially that the time_deleted condition is inverted between these two indexes. This index here is on time_deleted IS NOT NULL, so it includes only deleted silo groups.

I see the usage you mentioned in DataStore::silo_delete(). That query indeed specifies time_deleted IS NOT NULL, so it matches this index. But it looks to me like both might be wrong? At that point in silo_delete(), we haven't deleted the silo groups yet, right? That looks like the next thing we do. So it seems like the "delete all silo group memberships" section will only delete memberships from silo groups that were already deleted before the Silo was deleted. (Sorry if I'm misreading the code again!)

If the right thing is to flip the time_deleted condition in the query, then you'd want to flip the condition in this index too, and then it really should be redundant with the other index. So you should be able to just remove this one. (If I'm misreading the code and we do want to filter only on the deleted ones, then I'd suggest instead dropping the time_deleted IS NULL condition on the other index and removing this one.)

The second usage you mentioned (silo group delete code) is specifying time_deleted IS NULL, so it cannot use this index (but should use the other index).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that looks like a bug. We delete silo group memberships before silo groups, meaning time_deleted is null. I have no idea how this passes the restriction on full table scans...?

The only theory that makes sense to me is there wasn't a test for deleting a silo that contained a user, group, and group memberships, meaning the number of rows was zero. When I added one, the test failed because group memberships remained, not because of the full table scan restriction.

Commit 6e79986 changes the index, the filter in silo_delete, and adds the test I'm talking about.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the indexes again offline and I'll summarize here. I'd suggested just one unique, partial index here. This triggered an unexpected error on the INSERT ... ON CONFLICT. I filed #1545 to describe this and our subsequent research. In conclusion I think there are two okay options here:

  1. Keep the single unique partial index and use on_conflict_do_nothing(). (@jmpesp I only discovered this option after we talked about this.) We may want to add a test for trying to ensure the same group twice?
  2. Keep the single unique partial index and drop the on_conflict part of the insert. The problem with this is that it means that if two users log in at about the same time, both in some IdP group that we've never seen before, one of them will fail spuriously. But if that user retries, it will work. That's a lot of conditions, the impact is small, and there's a trivial workaround (retry). It'd be nice to fix this but it's hard for me to justify asking to spend much time on this case right now.

Either is okay with me -- but we should do one of these. As it is right now, since we have a unique non-partial index, it would incorrectly prevent us from reusing the external_id of a deleted silo group.

This is my last worry about this change -- otherwise I think it's good to go!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in df37612, thanks for debugging with me :)

let mut group_role_assignments = dsl::role_assignment
.filter(dsl::identity_type.eq(IdentityType::SiloGroup))
.filter(dsl::identity_id.eq_any(
silo_group_membership::dsl::silo_group_membership
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work okay. It's a little different than what I'd suggested, which was:

SELECT r.* from role_assignment r, silo_group_membership m WHERE
    m.silo_user_id == $1 AND
    r.actor_type == 'silo_group' AND
    r.actor_id == m.silo_group_id AND
    r.resource_type == $2 AND
    r.resource_id = $3

This looks more like:

SELECT * from role_assignment WHERE
    r.identity_type == 'silo_group' AND
    r.identity_id IN (SELECT silo_group_id FROM silo_group_membership WHERE silo_user_id = $1) AND
    r.resource_type == $2 AND
    r.resource_id = $3

This uses a subquery instead of a join. The main downside is that the database will materialize the full list of a user's groups and check that set against all the matching role assignments. The join might also have to do that but I think the database would have more freedom to choose the best approach (e.g., if there were a large number of role assignments and few groups, it could choose instead to enumerate the groups and check each one for a matching role assignment). I assume group membership will be small for the foreseeable future and this won't be a problem.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I really like the new transaction for silo create. Thanks for doing that. There's just 1-2 things here that I think are worth doing (the index, the "limit 1", and maybe the "on conflict" change? but that case is pretty obscure).

);

CREATE UNIQUE INDEX ON omicron.public.silo_group (
name,
CREATE INDEX ON omicron.public.silo_group (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. When those queries become paginated in the future, they're going to need the id column. I'd suggest making this a UNIQUE index on (silo_id, id). I don't think it makes any difference right now. But we've done that elsewhere for pagination and it'll avoid having to do a schema migration later.

}
}

impl From<&Actor> for db::model::IdentityType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, why the change from actor_type() to a From impl? It seems the same signature (&Actor -> db::model::IdentityType) and implementation. (I know I'd suggested that we use a separate return type for actor_type() that would be an enum of just UserBuiltin and SiloUser (but not SiloGroup). This change doesn't change that, which is fine. I was wondering if there was some other reason you preferred From to a function.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In situations where a section of code has an Actor, I think this reduces the possibility that code can ask the wrong question with match, like if the Actor is somehow a SiloGroup. Previously, the code matched on the result of a function call:

match actor.actor_type() {
    SiloGroup => { ... }
}

Here it seems like a natural question to see if an actor is a group, which doesn't make sense because the Actor enum doesn't contain that variant.

With the From, the user would have to type something like

let actor_type: db::model::IdentityType = actor.into();
match actor_type {
    SiloGroup => { ... }
}

or

match actor.into::<db::model::IdentityType>() {
    SiloGroup => { ... }
}

Which is explicitly asking questions about the db identity type, not what actor is - it's much more natural to type match actor { instead.

// are logging in at the same time, and both are part of IdP groups that
// do not yet exist in our database.
//
// Currently there is a unique partial index on silo_group, which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the Diesel we're using did support this now. Is that not right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part of the comment refers to "diesel does not support creating queries that contain ON CONFLICT (...) WHERE ... DO NOTHING.

I looked at #1545 and saw that

.on_conflict((dsl::silo_id, dsl::external_id))
.filter_target(dsl::time_deleted.is_null())
.do_nothing()

was suggested, and it works! 643aadb changes this and removes the comment.

@jmpesp jmpesp marked this pull request as ready for review August 4, 2022 21:22
@jmpesp jmpesp merged commit c08906c into oxidecomputer:main Aug 4, 2022
@jmpesp jmpesp deleted the silo_groups branch August 4, 2022 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants