Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: reset restored jobs during cluster restore #63756

Merged
merged 2 commits into from
Apr 19, 2021

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Apr 15, 2021

Previously, jobs were restored without modification during cluster
restore. Due to a recently discovered bug where backup may miss
non-transactional writes written to offline spans by these jobs, their
progress may no longer be accurate on the restored cluster.

IMPORT and RESTORE jobs perform non-transactional writes that may be
missed. When a cluster RESTORE brings back these OFFLINE tables, it will
also bring back its associated job. To ensure the underlying data in
these tables is correct, the jobs are now set in a reverting state so
that they can clean up after themselves.

In-progress schema change jobs that are affected, will fail upon
validation.

Release note (bug fix): Fix a bug where restored jobs may have assumed
to have made progress that was not captured in the backup. The restored
jobs are now either canceled cluster restore.

@pbardea pbardea requested review from dt and a team April 15, 2021 20:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt dt requested a review from ajwerner April 15, 2021 22:37
@pbardea
Copy link
Contributor Author

pbardea commented Apr 15, 2021

Although the new validation during restore will catch restores mid-schema changes, since OFFLINE spans can no longer be trusted IMPORT and RESTORE jobs at the very least need to be updated. I've carved out the minimum set of changes from #62638 to backport. That being said, since earlier versions didn't have the streaming executor, the backports are likely going to look fairly different.

@@ -604,6 +604,11 @@ func restore(
return emptyRowCount, nil
}

details := job.Details().(jobspb.RestoreDetails)
if alreadyMigrated := checkForMigratedData(details, dataToRestore); alreadyMigrated {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now understand why this early return if any table has been migrated is safe -- we call restore then we migrate the restored system tables later, outside restore, right?

maybe a comment? or maybe we should just skip calling restore at all on resume if dataToRestore is empty or migratedData is true instead of early returning? or maybe I'm just being dense today, that could be true too.

in any case, totally non-blocking, just rambling out-loud here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only catch here is that we should check for migrations per-batch. If the earlier batch (that restores the zones) needs a migration, that migration should happen before the restoration of the main data. We'll probably need something like this when re-keying the contents of system keys.

I kept it inside the restore since it was common to both the attempts of the "preData" restoration (that of the zones table), and the main data restoration.

details.SystemTablesMigrated[systemTableName] = true
return r.job.SetDetails(ctx, txn, details)
}); err != nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter points out this probably wanted to be err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pbardea
Copy link
Contributor Author

pbardea commented Apr 16, 2021

I've updated the jobs to go to cancel-requested rather than reverting. I had hesitations about the registry being upset that the job did not have an error set on its payload when in reverting.

Previously, custom implementations of restoring system tables during
cluster restore may have not been idempotent. As such, a map was used to
track when particular system tables had been restored. This was fragile.
This change updates the system table restoration logic to be idempotent
for all custom implementation (only the jobs table needed updating).

Release note: None
Previously, jobs were restored without modification during cluster
restore. Due to a recently discovered bug where backup may miss
non-transactional writes written to offline spans by these jobs, their
progress may no longer be accurate on the restored cluster.

IMPORT and RESTORE jobs perform non-transactional writes that may be
missed. When a cluster RESTORE brings back these OFFLINE tables, it will
also bring back its associated job. To ensure the underlying data in
these tables is correct, the jobs are now set in a reverting state so
that they can clean up after themselves.

In-progress schema change jobs that are affected, will fail upon
validation.

Release note (bug fix): Fix a bug where restored jobs may have assumed
to have made progress that was not captured in the backup. The restored
jobs are now either canceled cluster restore.
@pbardea
Copy link
Contributor Author

pbardea commented Apr 19, 2021

TFTRs
bors r=dt

@craig
Copy link
Contributor

craig bot commented Apr 19, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Apr 19, 2021

Build succeeded:

@craig craig bot merged commit 4dc05cc into cockroachdb:master Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants