feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

VAveryanov8 · 2024-12-17T16:57:49Z

This adds following data to the backup manifest:

General:

cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag

Instance Details:

shard_count: number of shard in scylla node
storage_size: total size of the disk in bytes
cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise

This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.

Fixes: #4130

Please make sure that:

Code is split to commits that address a single change
Commit messages are informative
Commit titles have module prefix
Commit titles have issue nr. suffix

VAveryanov8 · 2024-12-18T15:58:37Z

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

karol-kokoszka · 2024-12-18T16:42:53Z

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

VAveryanov8 · 2024-12-18T17:04:21Z

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - it will be a relatively significant change + snapshot_id and task_id are currently missing.

Michal-Leszczynski · 2024-12-19T08:19:52Z

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - if will be a relatively significant change + snapshot_id and task_id are currently missing.

I guess that we can add them to the manifest file when uploading the manifest, but we can set them in ManifestInfo when reading manifest in the same way as we are doing it today - by parsing path. This way those changes wouldn't require any adjustments in the code base, but they will make manifests more self-contained, which might be helpful in the future.

VAveryanov8 · 2024-12-19T10:10:36Z

I've updated pr - added snapshot_id and task_id, it's ready for review 👁️

VAveryanov8 · 2024-12-20T10:04:26Z

@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️

Michal-Leszczynski · 2024-12-20T10:08:25Z

go.mod

 go 1.23.2

 require (
+	cloud.google.com/go/compute/metadata v0.3.0


Going with the squash and merge and mixing commits that update vendor and implement features is messy. In order to make it cleaner, we can either:

make a separate PR for updating vendor

don't use squash and merge - the owner of the PR would need to manually squash commits with some reasonable logic (e.g. separate vendor changes from feature implementations) and then use the rebase and merge option

Yes, I can squash changes manually before merging and then merge without squashing.

pkg/cmd/agent/nodeinfo_linux.go

Michal-Leszczynski · 2024-12-20T10:36:43Z

pkg/service/backup/worker_manifest.go

+// manifestInstanceDetails collects node/instance specific information that's needed for 1-to-1 restore.
+func (w *worker) manifestInstanceDetails(ctx context.Context, host hostInfo) (InstanceDetails, error) {
+	var result InstanceDetails
+
+	shardCound, err := w.Client.ShardCount(ctx, host.IP)
+	if err != nil {
+		return InstanceDetails{}, errors.Wrap(err, "client.ShardCount")
+	}
+	result.ShardCount = int(shardCound)
+
+	nodeInfo, err := w.Client.NodeInfo(ctx, host.IP)
+	if err != nil {
+		return InstanceDetails{}, errors.Wrap(err, "client.NodeInfo")
 	}
+	result.StorageSize = nodeInfo.StorageSize
+
+	metaSvc, err := cloudmeta.NewCloudMeta(w.Logger)
+	if err != nil {
+		return InstanceDetails{}, errors.Wrap(err, "new cloud meta svc")
+	}
+
+	instanceMeta, err := metaSvc.GetInstanceMetadata(ctx)
+	if err != nil {
+		// Metadata may not be available for several reasons:
+		// 1. running on-premise 2. disabled 3. smth went wrong with metadata server.
+		// As we cannot distiguish between this cases we can only log err and continue with backup.
+		w.Logger.Error(ctx, "Get instance metadata", "err", err)
+	}
+	result.CloudProvider = string(instanceMeta.CloudProvider)
+	result.InstanceType = instanceMeta.InstanceType
+
+	return result, nil


Unfortunately, all of this code is executed on the SM side.
I guess that querying instance metadata should be done on the agent side?

oh, I missed that, you're right!

Do you think NodeInfo is a good place to extend with InstanceType, CloudProvider information? Or it's better to have a separate call for them?

I wouldn't mix them if it meant that we always try to query multiple providers when fetching NodeInfo, as it's used in multiple places in the code.
But I guess that it's safe to assume that the instance type won't change at runtime, so we can cache this value in agent memory and query it only once per agent re-start. The problem could be to distinguish cases of getting timeout on querying instance type vs querying it on on-prem.
Another approach could be to extend NodeInfo API call to include optional query param specifying that the instance details should also be included into the node info, but that's not that different from writing a separate endpoint for them.

I think I'll go with separate call just for the metadata

…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.

This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.

This extends agent server with `/cloud/metadata` endpoint which returns instance details such as `cloud_provider` and `instance_type`.

This adds following data to the backup manifest: General: cluster_id: uuid of the cluster dc: data center name rack: rack from the scylla configuration node_id: id of the scylla node (equals to host id) task_id: uuid of the backup task snapshot_tag: snapshot tag Instance Details: shard_count: number of shard in scylla node storage_size: total size of the disk in bytes cloud_provider: aws|gcp|azure or empty in case of on-premise instance_type: instance type, e.g. t2.nano or empty when on-premise Fixes: #4130

This fixes the issue when context that was passed to GetInstanceMetadata is canceled before any of provider's functions returned.

VAveryanov8 · 2024-12-23T11:06:33Z

TODO before merge

Once feat(swagger): adds /cloud/metadata to agent api definition #4186 is merged, update vendor to use v3/swagger from master branch. Currently this pr uses v3/swagger from the branch.
Manually squash changes

VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch 2 times, most recently from cedece2 to acf312c Compare December 18, 2024 15:43

VAveryanov8 marked this pull request as ready for review December 18, 2024 16:34

VAveryanov8 requested review from karol-kokoszka and Michal-Leszczynski as code owners December 18, 2024 16:34

Michal-Leszczynski requested changes Dec 20, 2024

View reviewed changes

VAveryanov8 added 6 commits December 23, 2024 11:24

chore: updates scylla-manager module to the latest version of `v3/swa…

3590fa4

…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.

feat(agent): extends /node_info response with storage_size.

34d53aa

This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.

feat(agent): implements /cloud/metadata agent endpoint.

9b808e6

This extends agent server with `/cloud/metadata` endpoint which returns instance details such as `cloud_provider` and `instance_type`.

fix(unit-tests): adds data_directory to node_info golden files.

dd5f6bd

fix(cloudmeta): fixes handling of context cancellation.

0599418

This fixes the issue when context that was passed to GetInstanceMetadata is canceled before any of provider's functions returned.

VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 164ff2c to 0599418 Compare December 23, 2024 10:54

VAveryanov8 changed the base branch from master to va/extend-backup-manifest-agent-metadata December 23, 2024 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

VAveryanov8 commented Dec 17, 2024 •

edited

Loading

VAveryanov8 commented Dec 18, 2024

karol-kokoszka commented Dec 18, 2024

VAveryanov8 commented Dec 18, 2024 •

edited

Loading

Michal-Leszczynski commented Dec 19, 2024

VAveryanov8 commented Dec 19, 2024

VAveryanov8 commented Dec 20, 2024

Michal-Leszczynski Dec 20, 2024

VAveryanov8 Dec 20, 2024

Michal-Leszczynski Dec 20, 2024

VAveryanov8 Dec 20, 2024

VAveryanov8 Dec 20, 2024

Michal-Leszczynski Dec 20, 2024

VAveryanov8 Dec 20, 2024

VAveryanov8 commented Dec 23, 2024

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

Are you sure you want to change the base?

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

Conversation

VAveryanov8 commented Dec 17, 2024 • edited Loading

General:

Instance Details:

VAveryanov8 commented Dec 18, 2024

karol-kokoszka commented Dec 18, 2024

VAveryanov8 commented Dec 18, 2024 • edited Loading

Michal-Leszczynski commented Dec 19, 2024

VAveryanov8 commented Dec 19, 2024

VAveryanov8 commented Dec 20, 2024

Michal-Leszczynski Dec 20, 2024

Choose a reason for hiding this comment

VAveryanov8 Dec 20, 2024

Choose a reason for hiding this comment

Michal-Leszczynski Dec 20, 2024

Choose a reason for hiding this comment

VAveryanov8 Dec 20, 2024

Choose a reason for hiding this comment

VAveryanov8 Dec 20, 2024

Choose a reason for hiding this comment

Michal-Leszczynski Dec 20, 2024

Choose a reason for hiding this comment

VAveryanov8 Dec 20, 2024

Choose a reason for hiding this comment

VAveryanov8 commented Dec 23, 2024

VAveryanov8 commented Dec 17, 2024 •

edited

Loading

VAveryanov8 commented Dec 18, 2024 •

edited

Loading