Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

Open
wants to merge 6 commits into
base: va/extend-backup-manifest-agent-metadata
Choose a base branch
from

Conversation

VAveryanov8
Copy link
Collaborator

@VAveryanov8 VAveryanov8 commented Dec 17, 2024

This adds following data to the backup manifest:

General:

cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag

Instance Details:

shard_count: number of shard in scylla node
storage_size: total size of the disk in bytes
cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise

This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.

Fixes: #4130


Please make sure that:

  • Code is split to commits that address a single change
  • Commit messages are informative
  • Commit titles have module prefix
  • Commit titles have issue nr. suffix

@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch 2 times, most recently from cedece2 to acf312c Compare December 18, 2024 15:43
@VAveryanov8
Copy link
Collaborator Author

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

@VAveryanov8 VAveryanov8 marked this pull request as ready for review December 18, 2024 16:34
@karol-kokoszka
Copy link
Collaborator

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

@VAveryanov8
Copy link
Collaborator Author

VAveryanov8 commented Dec 18, 2024

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - it will be a relatively significant change + snapshot_id and task_id are currently missing.

@Michal-Leszczynski
Copy link
Collaborator

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - if will be a relatively significant change + snapshot_id and task_id are currently missing.

I guess that we can add them to the manifest file when uploading the manifest, but we can set them in ManifestInfo when reading manifest in the same way as we are doing it today - by parsing path. This way those changes wouldn't require any adjustments in the code base, but they will make manifests more self-contained, which might be helpful in the future.

@VAveryanov8
Copy link
Collaborator Author

I've updated pr - added snapshot_id and task_id, it's ready for review 👁️

@VAveryanov8
Copy link
Collaborator Author

@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️

Comment on lines 3 to +6
go 1.23.2

require (
cloud.google.com/go/compute/metadata v0.3.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with the squash and merge and mixing commits that update vendor and implement features is messy. In order to make it cleaner, we can either:

  • make a separate PR for updating vendor
  • don't use squash and merge - the owner of the PR would need to manually squash commits with some reasonable logic (e.g. separate vendor changes from feature implementations) and then use the rebase and merge option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can squash changes manually before merging and then merge without squashing.

pkg/cmd/agent/nodeinfo_linux.go Show resolved Hide resolved
Comment on lines 117 to 140
// manifestInstanceDetails collects node/instance specific information that's needed for 1-to-1 restore.
func (w *worker) manifestInstanceDetails(ctx context.Context, host hostInfo) (InstanceDetails, error) {
var result InstanceDetails

shardCound, err := w.Client.ShardCount(ctx, host.IP)
if err != nil {
return InstanceDetails{}, errors.Wrap(err, "client.ShardCount")
}
result.ShardCount = int(shardCound)

nodeInfo, err := w.Client.NodeInfo(ctx, host.IP)
if err != nil {
return InstanceDetails{}, errors.Wrap(err, "client.NodeInfo")
}
result.StorageSize = nodeInfo.StorageSize

metaSvc, err := cloudmeta.NewCloudMeta(w.Logger)
if err != nil {
return InstanceDetails{}, errors.Wrap(err, "new cloud meta svc")
}

instanceMeta, err := metaSvc.GetInstanceMetadata(ctx)
if err != nil {
// Metadata may not be available for several reasons:
// 1. running on-premise 2. disabled 3. smth went wrong with metadata server.
// As we cannot distiguish between this cases we can only log err and continue with backup.
w.Logger.Error(ctx, "Get instance metadata", "err", err)
}
result.CloudProvider = string(instanceMeta.CloudProvider)
result.InstanceType = instanceMeta.InstanceType

return result, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, all of this code is executed on the SM side.
I guess that querying instance metadata should be done on the agent side?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I missed that, you're right!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think NodeInfo is a good place to extend with InstanceType, CloudProvider information? Or it's better to have a separate call for them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mix them if it meant that we always try to query multiple providers when fetching NodeInfo, as it's used in multiple places in the code.
But I guess that it's safe to assume that the instance type won't change at runtime, so we can cache this value in agent memory and query it only once per agent re-start. The problem could be to distinguish cases of getting timeout on querying instance type vs querying it on on-prem.
Another approach could be to extend NodeInfo API call to include optional query param specifying that the instance details should also be included into the node info, but that's not that different from writing a separate endpoint for them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll go with separate call just for the metadata

…gger`.

This updates scylla-manager module to the latest version of `v3/swagger` package.
This extends agent `/node_info` response with `stroage_size` and
`data_directory` fields.
This extends agent server with `/cloud/metadata` endpoint which returns
instance details such as `cloud_provider` and `instance_type`.
This adds following data to the backup manifest:
General:
  cluster_id: uuid of the cluster
  dc: data center name
  rack: rack from the scylla configuration
  node_id: id of the scylla node (equals to host id)
  task_id: uuid of the backup task
  snapshot_tag: snapshot tag
Instance Details:
  shard_count: number of shard in scylla node
  storage_size: total size of the disk in bytes
  cloud_provider: aws|gcp|azure or empty in case of on-premise
  instance_type: instance type, e.g. t2.nano or empty when on-premise

Fixes: #4130
This fixes the issue when context that was passed to GetInstanceMetadata is
canceled before any of provider's functions returned.
@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 164ff2c to 0599418 Compare December 23, 2024 10:54
@VAveryanov8 VAveryanov8 changed the base branch from master to va/extend-backup-manifest-agent-metadata December 23, 2024 10:54
@VAveryanov8
Copy link
Collaborator Author

TODO before merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend the backup manifest
3 participants