Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth: Fix missing snapshots and backups from storage pool used-by URLs #14324

Merged
merged 18 commits into from
Nov 20, 2024

Conversation

markylaing
Copy link
Contributor

@markylaing markylaing commented Oct 22, 2024

The underlying cause of this bug was that general filtering of used-by URLs makes the assumption that the can_view entitlement is available for all entity types. It is a fair assumption, but wasn't true for storage volume or instance backups or snapshots.

To fix this, four new entity types have been added to the authorization model:

  • instance_backup
  • instance_snapshot
  • storage_volume_backup
  • storage_volume_snapshot

Each has associated entitlements:

  • can_edit
  • can_view
  • can_delete

It is still not possible to grant these entitlements via the API. Instead, they are granted via can_manage_snapshots or can_manage_backups on the associated instance or storage volume.

The OpenFGADatastore implementation has been updated to handle instance and storage_volume relations between the parent and it's snapshots/backups.

  • Update OpenFGADatastore comments - it says an instance is not a relation but after this PR it is.

Closes #14291

@markylaing markylaing added the Bug Confirmed to be a bug label Oct 22, 2024
@markylaing markylaing added this to the lxd-6.2 milestone Oct 22, 2024
@markylaing markylaing self-assigned this Oct 22, 2024
@markylaing markylaing requested a review from tomponline October 22, 2024 12:56
@github-actions github-actions bot added the Documentation Documentation needs updating label Oct 22, 2024
Copy link

Heads up @mionaalex - the "Documentation" label was applied to this issue.

@markylaing
Copy link
Contributor Author

CC @mas-who @edlerd

@tomponline
Copy link
Member

tests are sad

@markylaing
Copy link
Contributor Author

markylaing commented Oct 23, 2024

@tomponline tests are mostly green except for one: https://github.com/canonical/lxd/actions/runs/11463430112/job/31913802942#step:12:52804

I'm not certain why this is failing as it doesn't seem to have anything to do with this PR. It is potentially related to #14315 since lxc profile assign calls PUT /1.0/instances/{name} which does some Profile.ToAPI work. I also don't understand why it only failed with the dir storage backend.

Edit: Note that this also doesn't fail locally. I'll have to get a tmate session running.

@tomponline
Copy link
Member

tomponline commented Oct 23, 2024

I'm not certain why this is failing as it doesn't seem to have anything to do with this PR. It is potentially related to #14315 since lxc profile assign calls PUT /1.0/instances/{name} which does some Profile.ToAPI work. I also don't understand why it only failed with the dir storage backend.

@hamistao please can you check this out, thanks

Seems like a panic.

@markylaing
Copy link
Contributor Author

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

@markylaing
Copy link
Contributor Author

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

I've been investigating this for an hour or so with no progress. It would be very useful to surface panics in the test logs. I'm trying to figure out a way to do this.

@tomponline
Copy link
Member

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

I've been investigating this for an hour or so with no progress. It would be very useful to surface panics in the test logs. I'm trying to figure out a way to do this.

Did you identify which commit introduced it yet?

Did you try reverting the earlier profiles PR?

@markylaing
Copy link
Contributor Author

Did you identify which commit introduced it yet?

Did you try reverting the earlier profiles PR?

With it being intermittent I didn't think reverting the profiles PR would tell me very much (e.g. I'll need to figure out where that panic is occurring in either case). I've added a commit to check LXD logs for panics. It's failing on standalone tests but not in the cluster tests which is a bit odd. Still investigating.

@markylaing markylaing force-pushed the used-by-bug branch 4 times, most recently from 4b56433 to bea0552 Compare October 25, 2024 10:22
@markylaing
Copy link
Contributor Author

I've re-run the test 8 times now and the panic only occurred on the first two runs. I've added a PR to handle panics a bit more cleanly in the future (#14346). If it happens again it should be obvious where it occurred.

@markylaing
Copy link
Contributor Author

Of course it fails again as soon as I move the panic checker work into another PR 🤦

@markylaing markylaing marked this pull request as draft October 25, 2024 12:26
@markylaing markylaing force-pushed the used-by-bug branch 5 times, most recently from 320408e to 6b8b335 Compare October 31, 2024 15:02
Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
The auth.ValidateEntitlement function validates all entitlements that
can be granted via the API. Since the new entitlements on snapshots and
backups cannot be granted via the API, this check fails.

The OpenFGA server will return an error if an invalid query is performed
based on it's own understanding of the authorization model.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
Previously the only entities that had inherited relations were project and
server. Now that we are linking instances and storage volumes to their
snapshots and backups, the OpenFGADatastore implementation needs to handle
these relations.

On Read, we can connect a snapshot or backup to its parent instance or
storage volume using the information stored in its URL. For example, the
storage volume backup URL:

/1.0/storage-pools/default/volumes/custom/vol1/backups/backup1?project=project1

is related to its parent:

/1.0/storage-pools/default/volumes/custom/vol1?project=project1

via the `storage_volume relation`.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
…tartingWithUser.

Previously the only entities that had inherited relations were project and
server. Now that we are linking instances and storage volumes to their
snapshots and backups, the OpenFGADatastore implementation needs to handle
these relations.

On ReadStartingWithUser, the function needs to return all backups or snapshots that
are related to a parent instance or storage volume. This is used in the `ListObjects`
call to the OpenFGA server, which is used by `(auth.Authorizer).GetPermissionChecker`.

To do this, I have naively queried for all snapshots or backups in the project, and
filtered out those that don't have the correct parent. This keeps the implementation
simple and makes use of `GetEntityURLs`, which performs as few queries as possible.
Further optimisation may be needed.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
We can now use the `can_view`, `can_edit`, and `can_delete` entitlements
with instance backups and snapshots. We should do this so that our checks
more accurately reflect the authorization model.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
The access handler was performing some logic to determine
the location of the storage volume for use in the access check.
This was based on whether the storage pool is remote, and if not,
the cluster member where the volume is located.

This commit removes that logic and adds a "location" field to
`storageVolumeDetails` so that it can be used in the handlers.
The logic for determining the location is modified to suit the call
site. It is only set when the pool is not remote.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
The storage volume snapshot and backup access handlers need to share
almost identical logic to the storage volume access handler. Including
getting the storage pool, understanding if the storage volume is located
on another cluster member, and so forth.

This commit parameterises the function so that it can be used by the
snapshot and backup entity types as well; creating and checking against
the correct URL when called.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
We can now check `can_view`, `can_edit`, and `can_delete` against
the backup/snapshot itself. We should do so to more accurately reflect
the authorization model.

Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
@markylaing markylaing force-pushed the used-by-bug branch 2 times, most recently from ca1786e to 16fc177 Compare November 18, 2024 10:59
test/main.sh Fixed Show fixed Hide fixed
test/main.sh Fixed Show fixed Hide fixed
@markylaing markylaing force-pushed the used-by-bug branch 2 times, most recently from a3a40ab to 0f45b26 Compare November 19, 2024 15:14
tomponline added a commit that referenced this pull request Nov 19, 2024
A suspected panic occurred in #14324. This happened twice but is
intermittent and does not occur locally. The test suite has now passed
for that PR 6/8 times.

To surface panics more cleanly when investigating the issue, I added a
python script to search LXD logs for a panic. Panics are logged at info
level because the net/http package has a default `recover()` when a
handler is run.

The script is run after each test suite to ensure the test fails if it
panics. The script is only run when `LXD_VERBOSE` or `LXD_DEBUG` are set
however. This is because the panic log entry is at info level, and is
not written if `--verbose` or `--debug` are unset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what prevents this?

These entitlements cannot be assigned to identities, service accounts, or group members.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enforced by openfga. Notice that normally an entitlement has [identity, service_account, group#member] next to it. This means that those types can have that relation. When generating the map of entity type to entitlement, that's what we look for. So auth.ValidateEntitlement is checking that the entitlement can be granted to a group member.

In this case we're defined the relations such that they can only be granted via the parent instance. In fact, no types can be directly related to an instance snapshot except for instances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -134,11 +134,6 @@ func (e *embeddedOpenFGA) CheckPermission(ctx context.Context, entityURL *api.UR
return fmt.Errorf("Failed to parse entity URL: %w", err)
}

err = auth.ValidateEntitlement(entityType, entitlement)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test for trying to assign an invalid entitlement? (to check openfga does indeed block this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a check for say:

! lxc auth group permission add instance_snapshot c1-snap can_view || false # Cannot be assigned via API

Or a check where we manipulate the auth_groups_permissions table to insert a permission that is invalid from the perspective of OpenFGA?

Signed-off-by: Mark Laing <mark.laing@canonical.com>
Signed-off-by: Mark Laing <mark.laing@canonical.com>
defer func() {
// Only set the location if the pool is not remote.
if details.pool.Driver() != nil && !details.pool.Driver().Info().Remote {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit, but you're calling .pool.Driver() twice here, would be more efficient to call it once and assign to a var locally right?

@markylaing markylaing marked this pull request as ready for review November 20, 2024 09:31
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great commit and code comments. Thanks!

@tomponline tomponline merged commit cd22ce3 into canonical:main Nov 20, 2024
10 of 11 checks passed
tomponline added a commit that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshots missing in used_by for custom volumes and storage pools on latest/edge LXD build
2 participants