backupccl: reset restored jobs during cluster restore #63756

pbardea · 2021-04-15T20:44:28Z

Previously, jobs were restored without modification during cluster
restore. Due to a recently discovered bug where backup may miss
non-transactional writes written to offline spans by these jobs, their
progress may no longer be accurate on the restored cluster.

IMPORT and RESTORE jobs perform non-transactional writes that may be
missed. When a cluster RESTORE brings back these OFFLINE tables, it will
also bring back its associated job. To ensure the underlying data in
these tables is correct, the jobs are now set in a reverting state so
that they can clean up after themselves.

In-progress schema change jobs that are affected, will fail upon
validation.

Release note (bug fix): Fix a bug where restored jobs may have assumed
to have made progress that was not captured in the backup. The restored
jobs are now either canceled cluster restore.

cockroach-teamcity · 2021-04-15T20:44:36Z

This change is

pbardea · 2021-04-15T23:39:44Z

Although the new validation during restore will catch restores mid-schema changes, since OFFLINE spans can no longer be trusted IMPORT and RESTORE jobs at the very least need to be updated. I've carved out the minimum set of changes from #62638 to backport. That being said, since earlier versions didn't have the streaming executor, the backports are likely going to look fairly different.

dt · 2021-04-16T01:28:19Z

pkg/ccl/backupccl/restore_job.go

@@ -604,6 +604,11 @@ func restore(
 return emptyRowCount, nil
 }

+ details := job.Details().(jobspb.RestoreDetails)
+ if alreadyMigrated := checkForMigratedData(details, dataToRestore); alreadyMigrated {


I think I now understand why this early return if any table has been migrated is safe -- we call restore then we migrate the restored system tables later, outside restore, right?

maybe a comment? or maybe we should just skip calling restore at all on resume if dataToRestore is empty or migratedData is true instead of early returning? or maybe I'm just being dense today, that could be true too.

in any case, totally non-blocking, just rambling out-loud here.

The only catch here is that we should check for migrations per-batch. If the earlier batch (that restores the zones) needs a migration, that migration should happen before the restoration of the main data. We'll probably need something like this when re-keying the contents of system keys.

I kept it inside the restore since it was common to both the attempts of the "preData" restoration (that of the zones table), and the main data restoration.

dt · 2021-04-16T01:35:33Z

pkg/ccl/backupccl/restore_job.go

+ details.SystemTablesMigrated[systemTableName] = true
+ return r.job.SetDetails(ctx, txn, details)
+ }); err != nil {
+ return nil


linter points out this probably wanted to be err

pbardea · 2021-04-16T02:08:45Z

I've updated the jobs to go to cancel-requested rather than reverting. I had hesitations about the registry being upset that the job did not have an error set on its payload when in reverting.

Previously, custom implementations of restoring system tables during cluster restore may have not been idempotent. As such, a map was used to track when particular system tables had been restored. This was fragile. This change updates the system table restoration logic to be idempotent for all custom implementation (only the jobs table needed updating). Release note: None

Previously, jobs were restored without modification during cluster restore. Due to a recently discovered bug where backup may miss non-transactional writes written to offline spans by these jobs, their progress may no longer be accurate on the restored cluster. IMPORT and RESTORE jobs perform non-transactional writes that may be missed. When a cluster RESTORE brings back these OFFLINE tables, it will also bring back its associated job. To ensure the underlying data in these tables is correct, the jobs are now set in a reverting state so that they can clean up after themselves. In-progress schema change jobs that are affected, will fail upon validation. Release note (bug fix): Fix a bug where restored jobs may have assumed to have made progress that was not captured in the backup. The restored jobs are now either canceled cluster restore.

pbardea · 2021-04-19T15:04:33Z

TFTRs
bors r=dt

craig · 2021-04-19T15:55:20Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-04-19T17:48:44Z

Build succeeded:

GitHub CI (Cockroach)

pbardea requested review from dt and a team April 15, 2021 20:44

dt requested a review from ajwerner April 15, 2021 22:37

dt approved these changes Apr 16, 2021

View reviewed changes

dt reviewed Apr 16, 2021

View reviewed changes

pbardea force-pushed the revert-non-sc-jobs branch from b81193f to 5044189 Compare April 16, 2021 02:07

pbardea added 2 commits April 19, 2021 13:41

pbardea force-pushed the revert-non-sc-jobs branch from 5044189 to 1c882a0 Compare April 19, 2021 13:41

craig bot merged commit 4dc05cc into cockroachdb:master Apr 19, 2021

pbardea mentioned this pull request Apr 29, 2021

release 21.1: reset restored jobs during cluster restore #64352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: reset restored jobs during cluster restore #63756

backupccl: reset restored jobs during cluster restore #63756

pbardea commented Apr 15, 2021

cockroach-teamcity commented Apr 15, 2021

pbardea commented Apr 15, 2021

dt Apr 16, 2021

pbardea Apr 16, 2021

dt Apr 16, 2021

pbardea Apr 16, 2021

pbardea commented Apr 16, 2021

pbardea commented Apr 19, 2021

craig bot commented Apr 19, 2021

craig bot commented Apr 19, 2021

backupccl: reset restored jobs during cluster restore #63756

backupccl: reset restored jobs during cluster restore #63756

Conversation

pbardea commented Apr 15, 2021

cockroach-teamcity commented Apr 15, 2021

pbardea commented Apr 15, 2021

dt Apr 16, 2021

Choose a reason for hiding this comment

pbardea Apr 16, 2021

Choose a reason for hiding this comment

dt Apr 16, 2021

Choose a reason for hiding this comment

pbardea Apr 16, 2021

Choose a reason for hiding this comment

pbardea commented Apr 16, 2021

pbardea commented Apr 19, 2021

craig bot commented Apr 19, 2021

craig bot commented Apr 19, 2021