-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177
base: va/extend-backup-manifest-agent-metadata
Are you sure you want to change the base?
feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177
Conversation
cedece2
to
acf312c
Compare
If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests. |
I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything. You mean that if we want to include snapshot_id and task_id then it's a significant change ? |
We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file. |
I guess that we can add them to the manifest file when uploading the manifest, but we can set them in |
I've updated pr - added snapshot_id and task_id, it's ready for review 👁️ |
@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️ |
go 1.23.2 | ||
|
||
require ( | ||
cloud.google.com/go/compute/metadata v0.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going with the squash and merge
and mixing commits that update vendor and implement features is messy. In order to make it cleaner, we can either:
- make a separate PR for updating vendor
- don't use
squash and merge
- the owner of the PR would need to manually squash commits with some reasonable logic (e.g. separate vendor changes from feature implementations) and then use therebase and merge
option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can squash changes manually before merging and then merge without squashing.
// manifestInstanceDetails collects node/instance specific information that's needed for 1-to-1 restore. | ||
func (w *worker) manifestInstanceDetails(ctx context.Context, host hostInfo) (InstanceDetails, error) { | ||
var result InstanceDetails | ||
|
||
shardCound, err := w.Client.ShardCount(ctx, host.IP) | ||
if err != nil { | ||
return InstanceDetails{}, errors.Wrap(err, "client.ShardCount") | ||
} | ||
result.ShardCount = int(shardCound) | ||
|
||
nodeInfo, err := w.Client.NodeInfo(ctx, host.IP) | ||
if err != nil { | ||
return InstanceDetails{}, errors.Wrap(err, "client.NodeInfo") | ||
} | ||
result.StorageSize = nodeInfo.StorageSize | ||
|
||
metaSvc, err := cloudmeta.NewCloudMeta(w.Logger) | ||
if err != nil { | ||
return InstanceDetails{}, errors.Wrap(err, "new cloud meta svc") | ||
} | ||
|
||
instanceMeta, err := metaSvc.GetInstanceMetadata(ctx) | ||
if err != nil { | ||
// Metadata may not be available for several reasons: | ||
// 1. running on-premise 2. disabled 3. smth went wrong with metadata server. | ||
// As we cannot distiguish between this cases we can only log err and continue with backup. | ||
w.Logger.Error(ctx, "Get instance metadata", "err", err) | ||
} | ||
result.CloudProvider = string(instanceMeta.CloudProvider) | ||
result.InstanceType = instanceMeta.InstanceType | ||
|
||
return result, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, all of this code is executed on the SM side.
I guess that querying instance metadata should be done on the agent side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I missed that, you're right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think NodeInfo is a good place to extend with InstanceType, CloudProvider information? Or it's better to have a separate call for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mix them if it meant that we always try to query multiple providers when fetching NodeInfo, as it's used in multiple places in the code.
But I guess that it's safe to assume that the instance type won't change at runtime, so we can cache this value in agent memory and query it only once per agent re-start. The problem could be to distinguish cases of getting timeout on querying instance type vs querying it on on-prem.
Another approach could be to extend NodeInfo API call to include optional query param specifying that the instance details should also be included into the node info, but that's not that different from writing a separate endpoint for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll go with separate call just for the metadata
…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.
This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.
This extends agent server with `/cloud/metadata` endpoint which returns instance details such as `cloud_provider` and `instance_type`.
This adds following data to the backup manifest: General: cluster_id: uuid of the cluster dc: data center name rack: rack from the scylla configuration node_id: id of the scylla node (equals to host id) task_id: uuid of the backup task snapshot_tag: snapshot tag Instance Details: shard_count: number of shard in scylla node storage_size: total size of the disk in bytes cloud_provider: aws|gcp|azure or empty in case of on-premise instance_type: instance type, e.g. t2.nano or empty when on-premise Fixes: #4130
This fixes the issue when context that was passed to GetInstanceMetadata is canceled before any of provider's functions returned.
164ff2c
to
0599418
Compare
TODO before merge
|
This adds following data to the backup manifest:
General:
cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag
Instance Details:
shard_count: number of shard in scylla node
storage_size: total size of the disk in bytes
cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise
This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.
Fixes: #4130
Please make sure that: