-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: fix backup/restore after FK table descriptor changes #39757
Conversation
5a790f8
to
3947732
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t need to block this PR, but given how important on-disk compatibility is, I want to bump the idea of having checked in testdata backups from prior releases, so that we can actually unit test that basic restore and sanity check works against past backups.
I opened #39772 to track adding cross-version RESTORE tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have a bunch of comments but none of them blocking.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @lucy-zhang)
pkg/ccl/backupccl/backup.go, line 1044 at r1 (raw file):
for i, uri := range incrementalFrom { // We don't read the table descriptors from the backup descriptor, so // there's no need to upgrade the table descriptors.
semi-nit: Could we try a little harder to guarantee that nobody reads those TDs as the code migrates? For example, you could nil the TableDescriptors field on this struct to make sure that if somebody gets confused they actually notice this message and reason about what's going on.
pkg/ccl/backupccl/backup.go, line 1279 at r1 (raw file):
var checkpointDesc *BackupDescriptor // We don't read the table descriptors from the backup descriptor, so // there's no need to upgrade the table descriptors.
ditto above.
pkg/ccl/backupccl/backup.go, line 1487 at r1 (raw file):
} func maybeDowngradeTableDescsInBackupDescriptor(
Could you please add a godoc to this method that explains the deal with the "maybe"?
pkg/ccl/backupccl/backup.go, line 1507 at r1 (raw file):
} func maybeUpgradeTableDescsInBackupDescriptors(
ditto Godoc.
pkg/ccl/backupccl/restore.go, line 1648 at r1 (raw file):
// writing old-style descs in RestoreDetails (unless a job persists across // an upgrade?). if err := maybeUpgradeTableDescsInBackupDescriptors(ctx, backupDescs, true /* skipFKsWithNoMatchingTable */); err != nil {
Are we supposed to unconditionally skip stuff here? I thought there was a setting about it.
pkg/ccl/backupccl/show.go, line 73 at r1 (raw file):
return err } if err := maybeUpgradeTableDescsInBackupDescriptors(ctx, []BackupDescriptor{desc}, true /*skipFKsWithNoMatchingTable*/); err != nil {
Please add a comment why this is true - why do we get to skip non-matching tables here?
pkg/ccl/cliccl/load.go, line 65 at r1 (raw file):
// fields are added to the output, the table descriptors may need to be // upgraded. desc, err := backupccl.ReadBackupDescriptorFromURI(ctx, basepath, cluster.NoSettings)
ditto comment about preventing people from reading TD's, unless they need to here?
This PR fixes the backup/restore cluster version incompatibility issues introduced by the FK table descriptor representation upgrade. It forces all mixed-version clusters to write backup descriptors with table descriptors that are compatible with 19.1, and enables 19.2 nodes to read 19.1 backups correctly. This is done by downgrading table descriptors (conditionally, based on the cluster version) every time they are written to disk or written to a backup manifest, and upgrading them when FKs need to be read or written. Backup descriptors are upgraded using a `protoGetter` that reads table descriptors from the backup descriptors themselves, taking the place of `Txn`, when resolving cross-table references. To handle backup descriptors containg tables that reference tables not in the backup, a new argument is added to `maybeUpgradeForeignKeyRepresentation` that allows for skipping foreign keys during this lookup process that can't be restored because a table is missing. I mostly tested this by running one of the failing tpch benchmarks (`tpchbench/tpchVec/nodes=3/cpu=4/sf=1`) to verify that it actually works, and restoring the tpch fixture in a real mixed-version cluster (with both 19.1 and 19.2 nodes as the gateway node) and a cluster that had been upgraded from 19.1 to 19.2. I also ran the `backupccl` tests while forcing the cluster to be on cluster version 19.1 to simulate the mixed-version state: ``` diff --git a/pkg/sql/create_table.go b/pkg/sql/create_table.go index 077f394442..a7549b0e2a 100644 --- a/pkg/sql/create_table.go +++ b/pkg/sql/create_table.go @@ -631,7 +631,7 @@ func ResolveFK( LegacyReferencedIndex: legacyReferencedIndexID, } - if !settings.Version.IsActive(cluster.VersionTopLevelForeignKeys) { + if !settings.Version.IsActive(cluster.VersionTopLevelForeignKeys) || true { legacyUpgradedFromOriginReference := sqlbase.ForeignKeyReference{ Table: target.ID, Index: legacyReferencedIndexID, diff --git a/pkg/sql/sqlbase/structured.go b/pkg/sql/sqlbase/structured.go index 5b7b507633..afc7d75be5 100644 --- a/pkg/sql/sqlbase/structured.go +++ b/pkg/sql/sqlbase/structured.go @@ -988,7 +988,7 @@ func maybeUpgradeForeignKeyRepOnIndex( func (desc *TableDescriptor) MaybeDowngradeForeignKeyRepresentation( ctx context.Context, clusterSettings *cluster.Settings, ) (bool, *TableDescriptor, error) { - downgradeUnnecessary := clusterSettings.Version.IsActive(cluster.VersionTopLevelForeignKeys) + downgradeUnnecessary := clusterSettings.Version.IsActive(cluster.VersionTopLevelForeignKeys) && false if downgradeUnnecessary { return false, desc, nil } ``` Release note: None
3947732
to
9063050
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis)
pkg/ccl/backupccl/backup.go, line 1044 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
semi-nit: Could we try a little harder to guarantee that nobody reads those TDs as the code migrates? For example, you could nil the TableDescriptors field on this struct to make sure that if somebody gets confused they actually notice this message and reason about what's going on.
Okay, I realized that we do actually read the table descriptors since we need the table/index spans from past incremental backups, though we still don't read the FKs. This does work for now, but I think it will be safer for future code to just do the upgrade. I added a TODO for myself.
pkg/ccl/backupccl/backup.go, line 1279 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
ditto above.
This one is different, since we basically just read the backup descriptor to append more File
s to it, ignoring the table descriptors. So we can't nil out the table descriptors, but I updated the comment to make it more clear.
pkg/ccl/backupccl/backup.go, line 1487 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Could you please add a godoc to this method that explains the deal with the "maybe"?
Done.
pkg/ccl/backupccl/backup.go, line 1507 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
ditto Godoc.
Done.
pkg/ccl/backupccl/restore.go, line 1648 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Are we supposed to unconditionally skip stuff here? I thought there was a setting about it.
We check that setting in doRestorePlan
, which is sufficient. I expanded the doc comment for the function.
pkg/ccl/backupccl/show.go, line 73 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Please add a comment why this is true - why do we get to skip non-matching tables here?
Done.
pkg/ccl/cliccl/load.go, line 65 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
ditto comment about preventing people from reading TD's, unless they need to here?
We do read the table descriptors below, but only to get the IDs. I can add another comment where we actually extract the table descriptors individually (though the potential problem is more general than just the FK upgrade).
bors r+ |
39757: backupccl: fix backup/restore after FK table descriptor changes r=lucy-zhang a=lucy-zhang This PR fixes the backup/restore cluster version incompatibility issues introduced by the FK table descriptor representation upgrade. It forces all mixed-version clusters to write backup descriptors with table descriptors that are compatible with 19.1, and enables 19.2 nodes to read 19.1 backups correctly. This is done by downgrading table descriptors (conditionally, based on the cluster version) every time they are written to disk or written to a backup manifest, and upgrading them when FKs need to be read or written. Backup descriptors are upgraded using a `protoGetter` that reads table descriptors from the backup descriptors themselves, taking the place of `Txn`, when resolving cross-table references. To handle backup descriptors containg tables that reference tables not in the backup, a new argument is added to `maybeUpgradeForeignKeyRepresentation` that allows for skipping foreign keys during this lookup process that can't be restored because a table is missing. I mostly tested this by running one of the failing tpch benchmarks (`tpchbench/tpchVec/nodes=3/cpu=4/sf=1`) to verify that it actually works, and restoring the tpch fixture in a real mixed-version cluster (with both 19.1 and 19.2 nodes as the gateway node) and a cluster that had been upgraded from 19.1 to 19.2. I also ran the `backupccl` tests while forcing the cluster to be on cluster version 19.1 to simulate the mixed-version state: ``` diff --git a/pkg/sql/create_table.go b/pkg/sql/create_table.go index 077f394442..a7549b0e2a 100644 --- a/pkg/sql/create_table.go +++ b/pkg/sql/create_table.go @@ -631,7 +631,7 @@ func ResolveFK( LegacyReferencedIndex: legacyReferencedIndexID, } - if !settings.Version.IsActive(cluster.VersionTopLevelForeignKeys) { + if !settings.Version.IsActive(cluster.VersionTopLevelForeignKeys) || true { legacyUpgradedFromOriginReference := sqlbase.ForeignKeyReference{ Table: target.ID, Index: legacyReferencedIndexID, diff --git a/pkg/sql/sqlbase/structured.go b/pkg/sql/sqlbase/structured.go index 5b7b507633..afc7d75be5 100644 --- a/pkg/sql/sqlbase/structured.go +++ b/pkg/sql/sqlbase/structured.go @@ -988,7 +988,7 @@ func maybeUpgradeForeignKeyRepOnIndex( func (desc *TableDescriptor) MaybeDowngradeForeignKeyRepresentation( ctx context.Context, clusterSettings *cluster.Settings, ) (bool, *TableDescriptor, error) { - downgradeUnnecessary := clusterSettings.Version.IsActive(cluster.VersionTopLevelForeignKeys) + downgradeUnnecessary := clusterSettings.Version.IsActive(cluster.VersionTopLevelForeignKeys) && false if downgradeUnnecessary { return false, desc, nil } ``` This PR is based on #39474. Fixes #39753. Release note: None Co-authored-by: Lucy Zhang <lucy-zhang@users.noreply.github.com>
Build succeeded |
This PR fixes the backup/restore cluster version incompatibility issues
introduced by the FK table descriptor representation upgrade. It forces all
mixed-version clusters to write backup descriptors with table descriptors that
are compatible with 19.1, and enables 19.2 nodes to read 19.1 backups
correctly. This is done by downgrading table descriptors (conditionally, based
on the cluster version) every time they are written to disk or written to a
backup manifest, and upgrading them when FKs need to be read or written.
Backup descriptors are upgraded using a
protoGetter
that reads tabledescriptors from the backup descriptors themselves, taking the place of
Txn
,when resolving cross-table references. To handle backup descriptors containg
tables that reference tables not in the backup, a new argument is added to
maybeUpgradeForeignKeyRepresentation
that allows for skipping foreign keysduring this lookup process that can't be restored because a table is missing.
I mostly tested this by running one of the failing tpch benchmarks
(
tpchbench/tpchVec/nodes=3/cpu=4/sf=1
) to verify that it actually works, andrestoring the tpch fixture in a real mixed-version cluster (with both 19.1 and
19.2 nodes as the gateway node) and a cluster that had been upgraded from 19.1
to 19.2. I also ran the
backupccl
tests while forcing the cluster to be oncluster version 19.1 to simulate the mixed-version state:
This PR is based on #39474.
Fixes #39753.
Release note: None