Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid updating a repository bucket/location if there are mounted searchable snapshots using it #89696

Open
fcofdez opened this issue Aug 29, 2022 · 19 comments
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@fcofdez
Copy link
Contributor

fcofdez commented Aug 29, 2022

When an existing repository bucket/location is updated and points to an empty directory, the RepositoryMetadata is cleared, that generates a new repository UUID. If there are mounted searchable snapshots using the previous repository, those cannot be allocated since the old data is not available in the new repository location, additionally the repository uuid in the mounted searchable snapshot is different. The following test reproduces the issue:

public class SearchableSnapshotsRepositoryIntegTests extends BaseFrozenSearchableSnapshotsIntegTestCase {

    public void testRepositoryUpdateWhileThereAreMountedSearchableSnapshots() throws Exception {
        final String repositoryName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT);
        final Settings.Builder repositorySettings = randomRepositorySettings();
        createRepository(repositoryName, FsRepository.TYPE, repositorySettings);

        final String indexName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT);
        createAndPopulateIndex(
            indexName,
            Settings.builder().put(INDEX_NUMBER_OF_SHARDS_SETTING.getKey(), 1).put(INDEX_SOFT_DELETES_SETTING.getKey(), true)
        );

        final TotalHits totalHits = internalCluster().client()
            .prepareSearch(indexName)
            .setTrackTotalHits(true)
            .get()
            .getHits()
            .getTotalHits();

        createSnapshot(repositoryName, "snap-1", List.of(indexName));

        String restoredIndexName =  "fully-mounted-" + indexName;
        mountSnapshot(repositoryName, "snap-1", indexName, restoredIndexName, Settings.EMPTY, Storage.FULL_COPY);

        assertHitCount(client().prepareSearch(restoredIndexName).setTrackTotalHits(true).get(), totalHits.value);

        Settings.Builder settings = randomRepositorySettings();
        createRepository(repositoryName, FsRepository.TYPE, settings);

        internalCluster().fullRestart();
        ensureGreen(restoredIndexName);
    }
...
@fcofdez fcofdez added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.5.0 v7.17.7 labels Aug 29, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

When an existing repository bucket/location is updated and points to an empty directory, the RepositoryMetadata is cleared, that generates a new repository UUID

That sounds like the desired behaviour? The repository UUID identifies the repository in terms of its contents, so if you change the repository contents then it deserves a new UUID. Really the same thing should happen if you swap the repository out from under ES by fiddling with symlinks or mounting a different filesystem too, but we don't have a good way to detect that so I think we'd report it as a repository corruption.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

If we allow it to continue to get a new UUID, what should happen to the mounted searchable snapshots that referred to the old UUID (and may not be valid anymore since the new repos may have different content)? Is it OK that they become Unassigned, and would we expect users to then unmount/delete them manually?

@DaveCTurner
Copy link
Contributor

It's a bit tricky and I don't see a good general solution (i.e. one which covers extrinsic things like adjusting symlinks or mounting a different filesystem). I mean the data really just isn't there any more, so it does seem reasonable to fail the shards and require user intervention to fix things up.

I can see some value in preventing changes to the location of a repository while it contains mounted snapshots, but I can also think of cases where it might be useful to be able to update the location of a repository in this state too. For instance, maybe the user really wants to move the same repository elsewhere in their filesystem? Or for blob-store-like repositories, maybe they want to rename the underlying bucket (not possible in proper S3, but I think you can achieve this in MinIO etc).

Also what do we even mean by "location"? Certainly this includes the filesystem path for FS repositories and the bucket and base path for S3, but maybe also endpoint? protocol? Proxy settings? User credentials?

@DaveCTurner
Copy link
Contributor

We could perhaps try and read the repository UUID from the new location and permit the change only if it is the same as the old one. But we wouldn't do that if the user said ?verify=false.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 5, 2022

I can see some value in preventing changes to the location of a repository while it contains mounted snapshots, but I can also think of cases where it might be useful to be able to update the location of a repository in this state too. For instance, maybe the user really wants to move the same repository elsewhere in their filesystem? Or for blob-store-like repositories, maybe they want to rename the underlying bucket (not possible in proper S3, but I think you can achieve this in MinIO etc).

Is this scenario representative? I think it isn't. I would prefer to reject changing the repository settings (we can argue which ones) when there are mounted searchable snapshots, that would make users aware of potential issues with the change.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

I agree with @fcofdez that it's less confusing for users to prohibit them changing a repository with mounted snapshots. I also agree with @DaveCTurner that there should be a way for somebody to "move" the repository without necessarily needing to unmount all snapshots and then re-mount them.

For the moment, in https://github.com/elastic/elasticsearch/blob/v8.4.0/server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java#L216 the canUpdateInPlace() function happens only if the original and new repository (plus metadata) are equal. We could give users a flag to override this check and allow them to update a repository in place at their own risk (with possibly new metadata, pointing to a new repository, but assuming the contents are the same). This would mean no change in UUID nor in generations. But it would be an explicit operation, and user would understand the implications (e.g., if the contents are changed, then it will become corrupt). This would be a new feature. Not sure if this idea would generally work, but I leave it here for discussion! This would be a new issue.

@DaveCTurner
Copy link
Contributor

This would mean no change in UUID nor in generations.

I think this isn't guaranteed - users can change the repository contents out from under us anyway, because they control the blob store/filesystem/whatever.

Is this scenario representative? I think it isn't.

It won't be that common, but nor do I think it rare enough that we can forbid it entirely. I'm ok with rejecting this kind of thing by default but I do think we should have an escape hatch. Maybe ?verify=false can bypass these checks, or maybe we need another API flag?

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

Thinking a bit more @fcofdez and @DaveCTurner ...

I see that unregisterRepository() (code) actually checks that ensureNoSearchableSnapshotsIndicesInUse(). And does not have a ?verify=false or force flag. I believe this is line with making sure the default case of editing of an existing repository also checks that ensureNoSearchableSnapshotsIndicesInUse().

Now about the escape hatch, to me, it means that the user can "move" or "re-register" a repository with different details, and the content is hopefully unchanged, in order to avoid having the user unmount all snapshots and re-mount them. In order to do that, the new repos needs to have the same UUID, else the searchable snapshots will be unassigned and complain. Now, the current behavior of PUT /_snapshot/my_repository is that it:

  • Either tries to update the repository in place (if it has the same details) and re-uses the same UUID/generations (at least what it has in the metadata)
  • Or creates a new UUID for the new repos details

The use case @DaveCTurner has in mind could fit the first bullet point, since it will attempt to use the same UUID. As long as contents have not changed, it should in theory continue working (even if the user e.g. moved the repo from AWS to GCP). And now I wonder why the 2nd bullet point exists at all? The second bullet point could be done equivalently by deleting the repository and creating a new one with the same name. Thus, my belief is that PUT /_snapshot/my_repository should always do the first bullet point, updating in place and re-using the same UUID/generations. We could discuss if it does that with also ensureNoSearchableSnapshotsIndicesInUse() if ?verify=true and without it if ?verify=false. But the whole point of editing an existing repository would be to attempt to keep the same UUID/generations. Also, it seems to me like this use case deserves a new issue than this one.

What do you think?

@DaveCTurner
Copy link
Contributor

The second bullet point could be done equivalently by deleting the repository and creating a new one with the same name.

That's a good point (aside: we should have an escape hatch to permit unregistering a repository even if it's in use, ignoring the consequences, but that's a separate issue). So let's focus on the case of having the same underlying data (i.e. the same repo UUID) at a different location. I believe we should check it does really have the same UUID by default (i.e. if ?verify=true). That seems sufficiently strong right? If so, we wouldn't need to do any other checks.

If ?verify=false then we can't check its contents match. Should we have extra checks in this case or should we just accept the risk?

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 5, 2022

And now I wonder why the 2nd bullet point exists at all? The second bullet point could be done equivalently by deleting the repository and creating a new one with the same name

My understanding is that this behaviour is allowed to recover from corrupted repositories.

That's a good point (aside: we should have an escape hatch to permit unregistering a repository even if it's in use, ignoring the consequences, but that's a separate issue). So let's focus on the case of having the same underlying data (i.e. the same repo UUID) at a different location. I believe we should check it does really have the same UUID by default (i.e. if ?verify=true). That seems sufficiently strong right? If so, we wouldn't need to do any other checks.

I'm still not convinced that this operation is common enough to add all this complexity to an already quite complex piece of our code. Currently we don't have a way to clone the repository contents into another bucket safely, I think we've discussed this in the past. But I might be missing some context.

@DaveCTurner
Copy link
Contributor

Currently we don't have a way to clone the repository contents into another bucket safely

A repository backup would let you do this.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

My understanding is that this behaviour is allowed to recover from corrupted repositories.

Isn't that also possible by equivalently deleting the repository and creating a new one with the same name?

So far, I believe having the PUT /_snapshot/my_repository keep the same UUID/generations by default (and checking the repository files to make sure this happens) is the best solution. And if ?verify=false, we skip this check, and even the check that no searchable snapshots are there.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 5, 2022

And if ?verify=false, we skip this check, and even the check that no searchable snapshots are there.

In that case we'll be changing the current verify semantics which it's fine I guess.

My concern here is that we're somehow incentivising messing around with the underlying repository contents and that's been problematic in the past, but if we think the benefits outweigh the risks, we can proceed with this approach.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 5, 2022

A repository backup would let you do this.

But there's still room for improvement on that area #54944

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

In that case we'll be changing the current verify semantics which it's fine I guess.

The ?verify flag is from the PUT /_snapshot/my_repository API, not the verify API itself. So we could have the PUT API do both the verify API but also check that the UUID/generations of the existing repository are there. And if the ?verify flag is false, it skips checks. We can modify the documentation of the PUT API to signify that.

My concern here is that we're somehow incentivising messing around with the underlying repository contents and that's been problematic in the past, but if we think the benefits outweigh the risks, we can proceed with this approach.

I share the concern. On the positive note, what we are discussing is allowing somebody to "move" a repository. But only move it, rather than messing with its contents. We should add this word of caution in the documentation of the PUT API when it's used on an existing repository.

But there's still room for improvement on that area #54944

Thanks for the info! But at least repository backup seems to have sufficient wording as to how to manually take a safe backup.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 5, 2022

The ?verify flag is from the PUT /_snapshot/my_repository API, not the verify API itself. So we could have the PUT API do both the verify API but also check that the UUID/generations of the existing repository are there. And if the ?verify flag is false, it skips checks. We can modify the documentation of the PUT API to signify that.

I meant this section within the PUT /_snapshot/my_repository API docs.

@kingherc
Copy link
Contributor

kingherc commented Sep 5, 2022

I meant this section within the PUT /_snapshot/my_repository API docs.

I see. I think that section is for new repositories typically, rather than registering on top of an existing repository. I think this is the main difference. For existing repositories, we would extend the PUT API's verification process to check that the UUID/generations are there as well. I don't think we would do anything backwards-incompatible here, since the existing "verify" semantics will be the same, but we would need to extend the documentation to mention the additional check that the PUT API will do in both pages (the one @fcofdez you mentioned, and the PUT API page).

@tomcallahan
Copy link
Contributor

I linked a support case here. I'm +1 on giving some way to change repository details, perhaps with verification, to administrators while searchable snapshots are mounted. I'm not sure what updates are allowed today - it seems clear we'd want to permit updating (removing) secure_key and access_key for folks migrating to secure settings.

The location update is also compelling. We provide a repository analyzer API so that folks can try to use searchable snapshots with other s3-like blockstores. It's not unreasonable for an on-prem administrator to migrate data from one device to another and want to point Elasticsearch to the new device with minimal disruption, or maybe when a system goes down. Maybe they want to update from http to https. I worry there could be a lot of reasonable maintenance operations we're not thinking of that we prevent if we lock this down too much.

@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@benwtrent benwtrent added v7.17.8 and removed v7.17.7 labels Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests