-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: reset restored jobs during cluster restore #63756
Conversation
Although the new validation during restore will catch restores mid-schema changes, since OFFLINE spans can no longer be trusted IMPORT and RESTORE jobs at the very least need to be updated. I've carved out the minimum set of changes from #62638 to backport. That being said, since earlier versions didn't have the streaming executor, the backports are likely going to look fairly different. |
@@ -604,6 +604,11 @@ func restore( | |||
return emptyRowCount, nil | |||
} | |||
|
|||
details := job.Details().(jobspb.RestoreDetails) | |||
if alreadyMigrated := checkForMigratedData(details, dataToRestore); alreadyMigrated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I now understand why this early return if any table has been migrated is safe -- we call restore
then we migrate the restored system tables later, outside restore
, right?
maybe a comment? or maybe we should just skip calling restore
at all on resume if dataToRestore is empty or migratedData is true instead of early returning? or maybe I'm just being dense today, that could be true too.
in any case, totally non-blocking, just rambling out-loud here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only catch here is that we should check for migrations per-batch. If the earlier batch (that restores the zones) needs a migration, that migration should happen before the restoration of the main data. We'll probably need something like this when re-keying the contents of system keys.
I kept it inside the restore since it was common to both the attempts of the "preData" restoration (that of the zones table), and the main data restoration.
pkg/ccl/backupccl/restore_job.go
Outdated
details.SystemTablesMigrated[systemTableName] = true | ||
return r.job.SetDetails(ctx, txn, details) | ||
}); err != nil { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linter points out this probably wanted to be err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b81193f
to
5044189
Compare
I've updated the jobs to go to cancel-requested rather than reverting. I had hesitations about the registry being upset that the job did not have an error set on its payload when in reverting. |
Previously, custom implementations of restoring system tables during cluster restore may have not been idempotent. As such, a map was used to track when particular system tables had been restored. This was fragile. This change updates the system table restoration logic to be idempotent for all custom implementation (only the jobs table needed updating). Release note: None
Previously, jobs were restored without modification during cluster restore. Due to a recently discovered bug where backup may miss non-transactional writes written to offline spans by these jobs, their progress may no longer be accurate on the restored cluster. IMPORT and RESTORE jobs perform non-transactional writes that may be missed. When a cluster RESTORE brings back these OFFLINE tables, it will also bring back its associated job. To ensure the underlying data in these tables is correct, the jobs are now set in a reverting state so that they can clean up after themselves. In-progress schema change jobs that are affected, will fail upon validation. Release note (bug fix): Fix a bug where restored jobs may have assumed to have made progress that was not captured in the backup. The restored jobs are now either canceled cluster restore.
5044189
to
1c882a0
Compare
TFTRs |
Build failed (retrying...): |
Build succeeded: |
Previously, jobs were restored without modification during cluster
restore. Due to a recently discovered bug where backup may miss
non-transactional writes written to offline spans by these jobs, their
progress may no longer be accurate on the restored cluster.
IMPORT and RESTORE jobs perform non-transactional writes that may be
missed. When a cluster RESTORE brings back these OFFLINE tables, it will
also bring back its associated job. To ensure the underlying data in
these tables is correct, the jobs are now set in a reverting state so
that they can clean up after themselves.
In-progress schema change jobs that are affected, will fail upon
validation.
Release note (bug fix): Fix a bug where restored jobs may have assumed
to have made progress that was not captured in the backup. The restored
jobs are now either canceled cluster restore.