Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add descriptor validation on write #60552

Merged
merged 3 commits into from
Feb 18, 2021

Conversation

postamar
Copy link
Contributor

Previously, we didn't systematically validate descriptors when they were
written. Furthermore, there existed no common method to validate
descriptors across all descriptor subtypes

This commit adds three methods to the catalog.Descriptor interface:

  1. ValidateSelf ( context.Context ) error
  2. Validate ( context.Context, catalog.DescGetter ) error
  3. ValidateTxnCommit ( context.Context, catalog.DescGetter) error

Each performs a subset the checks performed by the next. ValidateSelf
contains all checks which can be performed in isolation, Validate also
performs all those involving DescGetters (i.e. cross-reference checks)
and ValidateTxnCommit also performs checks which should only be done at
commit-time. An example of the latter is checking that a table has
a primary key: dropping the PK is allowed within a transaction as long
as a new PK is subsequently provided before committing.

This commit adds new validation calls when writing descriptors:

  1. ValidateSelf is called prior to Collection adding a descriptor Put
    to a kv.Batch. At this point, we want descriptors to be at least
    internally-consistent, both to catch validation errors early and
    because it's not possible to do cross-reference checking at this
    point (changes on FKs for instance involve multiple descriptors).
  2. ValidateTxnCommit is called on the descs.Collection's uncommitted
    descriptors when the corresponding txn is about to commit, just
    prior to the two-version-invariant check.

These validations may be disabled using a new cluster setting:
sql.catalog.descs.validate_on_write.enabled
Setting this to false makes it possible to corrupt the descriptor state
using the crdb_internal.unsafe_* functions.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@postamar
Copy link
Contributor Author

@ajwerner this still has the sql.catalog.descs.validate_on_write.enabled setting because it was built on #60492 but it's really not something I feel strongly about.

@postamar postamar force-pushed the add-desc-validation-on-write branch 2 times, most recently from 86f7e05 to fca1cfc Compare February 15, 2021 20:26
@postamar
Copy link
Contributor Author

In the last 3 runs, CI failed on at least one instance of TestLogic/*/enums/schema_changes with:

        testdata/logic_test/enums:759: 
        expected success, but found
        (42704) index "i" does not exist

for the following statement

CREATE TABLE enum_parent (x greeting PRIMARY KEY);
CREATE TABLE enum_child (x greeting, y greeting, PRIMARY KEY (x, y)) INTERLEAVE IN PARENT enum_parent (x);
INSERT INTO enum_parent VALUES ('hello');
INSERT INTO enum_child VALUES ('hello', 'hello');
CREATE INDEX i ON enum_child (x, y) INTERLEAVE IN PARENT enum_parent (x);
DROP INDEX enum_child@i

I tried to repro the failure locally to no avail, trying different test configs, native build vs docker build, race detector enabled... The failure definitely seems related to this change though. Any ideas?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fun failure. I wonder what that's about.

One though I'm having is that perhaps this is something we want to be doing after statement execution too. It'd be weird if we allowed things which used to be invalid to happen during the statement but then caught them at commit time. Maybe this isn't such a big deal because we are doing ValidateSelf when writing the descriptor

Reviewed 28 of 28 files at r1, 5 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/catalog/descriptor.go, line 77 at r1 (raw file):

	// Validate validates the descriptor.
	// Cross-reference checks are skipped when dg is nil.

nit: it's called descGetter here.


pkg/sql/catalog/descs/collection.go, line 1415 at r1 (raw file):

var _ catalog.DescGetter = collectionDescGetter{}

func (cdg collectionDescGetter) GetDesc(

I think that this is right but it's interesting. Namely, if we ever mucked with a descriptor and didn't cache it in the collection (just wrote it to the store), we'd have problems because this will use a leased descriptor.


pkg/sql/catalog/descs/collection.go, line 1439 at r1 (raw file):

Quoted 12 lines of code…
func (cdg collectionDescGetter) GetDescs(
	ctx context.Context, reqs []descpb.ID,
) ([]catalog.Descriptor, error) {
	ret := make([]catalog.Descriptor, len(reqs))
	for i, id := range reqs {
		desc, err := cdg.GetDesc(ctx, id)
		if err != nil {
			return nil, err
		}
		ret[i] = desc
	}
	return ret, nil

I'm feeling somewhat sad about the need to do this. I'm thinking we should just have a function in catalog called GetDescs that takes a DescGetter and then tries to type assert it into a batched getter. This is all to enable an optimization where we can batched fetch descriptors in some cases. It's not obviously worth it.


pkg/sql/catalog/schemadesc/schema_desc.go, line 183 at r2 (raw file):

}

// Validate punts to ValidateSelf.

This is an odd comment given it does other stuff too

Copy link
Contributor Author

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and thanks for helping me look into that test failure earlier today, much appreciated.

With regards to your comment regarding per-statement validation, I believe that that's effectively already the case today. Apparently the primary index of a table can be disabled during a transaction as long as it's re-enabled come commit time. Bit of an edge case though and this doesn't mean we shouldn't be validating after executing individual statements like you suggest.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/catalog/descriptor.go, line 77 at r1 (raw file):

Previously, ajwerner wrote…

nit: it's called descGetter here.

Done.


pkg/sql/catalog/descs/collection.go, line 1415 at r1 (raw file):

Previously, ajwerner wrote…

I think that this is right but it's interesting. Namely, if we ever mucked with a descriptor and didn't cache it in the collection (just wrote it to the store), we'd have problems because this will use a leased descriptor.

Agreed.


pkg/sql/catalog/descs/collection.go, line 1439 at r1 (raw file):

Previously, ajwerner wrote…
func (cdg collectionDescGetter) GetDescs(
	ctx context.Context, reqs []descpb.ID,
) ([]catalog.Descriptor, error) {
	ret := make([]catalog.Descriptor, len(reqs))
	for i, id := range reqs {
		desc, err := cdg.GetDesc(ctx, id)
		if err != nil {
			return nil, err
		}
		ret[i] = desc
	}
	return ret, nil

I'm feeling somewhat sad about the need to do this. I'm thinking we should just have a function in catalog called GetDescs that takes a DescGetter and then tries to type assert it into a batched getter. This is all to enable an optimization where we can batched fetch descriptors in some cases. It's not obviously worth it.

Agreed. I did something along those lines.


pkg/sql/catalog/schemadesc/schema_desc.go, line 183 at r2 (raw file):

Previously, ajwerner wrote…

This is an odd comment given it does other stuff too

Fixed.

@postamar postamar force-pushed the add-desc-validation-on-write branch 2 times, most recently from da5d121 to 977f456 Compare February 16, 2021 23:04
@postamar postamar marked this pull request as ready for review February 16, 2021 23:04
@postamar postamar requested review from a team and miretskiy and removed request for a team February 16, 2021 23:04
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! mostly just the suggestion and making sure your test completes (that failure looks odd)

Reviewed 4 of 11 files at r3, 2 of 5 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @miretskiy, and @postamar)


pkg/sql/conn_executor_exec.go, line 751 at r5 (raw file):

	}

	if err := ex.extraTxnState.descCollection.ValidateUncommittedDescriptors(ctx, ex.state.mu.txn); err != nil {

❤️


pkg/sql/catalog/desc_getter.go, line 28 at r5 (raw file):

Quoted 5 lines of code…
	GetDescs(ctx context.Context, reqs []descpb.ID) ([]Descriptor, error)
}
// GetDescsSequential is a helper function for implementing the GetDescs
// method of DescGetter as a sequence of GetDesc calls.
func GetDescsSequential(

Can we take this one further and do something like this:

// DescGetter is an interface to retrieve descriptors. In general the interface
// is used to look up other descriptors during validation.
type DescGetter interface {
	GetDesc(ctx context.Context, id descpb.ID) (Descriptor, error)
}

// BatchDescGetter is an interface to retrieve descriptors in a batch. It can be implemented by
// DescGetter to improve the performance of the GetDescs() function.
type BatchDescGetter interface {
        GetDescs(ctx context.Context, reqs []descpb.ID) ([]Descriptor, error)
}

// GetDescs is a helper to retrieve multiple descriptors using a DescGetter. If dg implements
// BatchDescGetter, it will be utilized.
func GetDescs(ctx context.Context, dg DescGetter, reqs []descpb.ID) ([]Descriptor, error) {
    if bdg, ok := dg.(BatchDescGetter); ok {
        return bdg.GetDescs(ctx, reqs)
    }
    ret := make([]Descriptor, len(reqs))
    for i, id := range reqs {
	desc, err := descGetter.GetDesc(ctx, id)
	if err != nil {
		return nil, err
	}
	ret[i] = desc
    }
    return ret, nil
}

@postamar postamar force-pushed the add-desc-validation-on-write branch 2 times, most recently from 2bbe7d3 to 8b67cc9 Compare February 17, 2021 13:58
Copy link
Contributor Author

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm hoping it should be good now.

The failure was caused by me checking the error result of the SET CLUSTER SETTING statement. Looking at the rest of the codebase it's common to just ignore that error so I "fixed" this failure by doing the same. Looking more closely at where the error comes from it seems we just wait 10 seconds before calling it quits: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/set_cluster_setting.go#L336
Perhaps that part of the code could do with a bit of love? Should I file an issue?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)


pkg/sql/catalog/desc_getter.go, line 28 at r5 (raw file):

Previously, ajwerner wrote…
	GetDescs(ctx context.Context, reqs []descpb.ID) ([]Descriptor, error)
}
// GetDescsSequential is a helper function for implementing the GetDescs
// method of DescGetter as a sequence of GetDesc calls.
func GetDescsSequential(

Can we take this one further and do something like this:

// DescGetter is an interface to retrieve descriptors. In general the interface
// is used to look up other descriptors during validation.
type DescGetter interface {
	GetDesc(ctx context.Context, id descpb.ID) (Descriptor, error)
}

// BatchDescGetter is an interface to retrieve descriptors in a batch. It can be implemented by
// DescGetter to improve the performance of the GetDescs() function.
type BatchDescGetter interface {
        GetDescs(ctx context.Context, reqs []descpb.ID) ([]Descriptor, error)
}

// GetDescs is a helper to retrieve multiple descriptors using a DescGetter. If dg implements
// BatchDescGetter, it will be utilized.
func GetDescs(ctx context.Context, dg DescGetter, reqs []descpb.ID) ([]Descriptor, error) {
    if bdg, ok := dg.(BatchDescGetter); ok {
        return bdg.GetDescs(ctx, reqs)
    }
    ret := make([]Descriptor, len(reqs))
    for i, id := range reqs {
	desc, err := descGetter.GetDesc(ctx, id)
	if err != nil {
		return nil, err
	}
	ret[i] = desc
    }
    return ret, nil
}

Done. Sorry I misunderstood you earlier, I thought you wanted to type switch on the interface implementation. Still not completely used to go's duck typing I guess.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the settings thing you're saying but otherwise :lgtm:

Perhaps that part of the code could do with a bit of love? Should I file an issue?

Yes

Reviewed 1 of 11 files at r3, 12 of 12 files at r6, 5 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @postamar)


pkg/sql/schema_changer.go, line 420 at r6 (raw file):

				if err := WaitToUpdateLeases(ctx, sc.leaseMgr, ancestor.TableID); err != nil {
					return err
				}

nit: accumulate the set of tables and call WaitToUpdateLeasesMultiple?


pkg/sql/catalog/desc_getter.go, line 27 at r6 (raw file):

// batchDescGetter is like DescGetter but retrieves batches of descriptors,
// which for some implementation may make more sense performance-wise.
type batchDescGetter interface {

style nit: I think it's valuable to export this as a well to tell an implementer of DescGetter that there's an extension which can be implemented.


pkg/sql/tests/repair_test.go, line 252 at r6 (raw file):

	setup := func(t *testing.T) (serverutils.TestServerInterface, *gosql.DB, func()) {
		s, db, _ := serverutils.StartServer(t, base.TestServerArgs{})
		_, _ = db.Exec(`SET CLUSTER SETTING sql.catalog.descs.validate_on_write.enabled = false`)

wait, why does this ever fail?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still confused about the failing setting thing but :lgtm_strong:

Reviewed 5 of 11 files at r9, 5 of 5 files at r10, 1 of 1 files at r11.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @postamar)

@postamar
Copy link
Contributor Author

In the tests we usually don't check that a SET CLUSTER SETTING statement has effectively updated the setting value in the whole cluster via gossip. What happens is that these statements wait for 10 seconds then return an error setting updated but timed out waiting to read new value which, in many tests, is ignored. This was surprising to me at first but I'm starting to see the logic of it now. The only problem I have with that code now is that this 10 seconds timeout threshold is a magic number, I feel it should be defined as a session variable. Instead of an issue I'll just file a PR.

In the test case in this PR we set validate_on_write to true and this sometimes triggers this error under stress. I'm guessing that the reason it's no big deal is that the setting value still gets changed locally and unit tests such as this one run on single node clusters anyway so it doesn't matter that the gossip still hasn't done its thing.

@postamar
Copy link
Contributor Author

I forgot to mention WaitToUpdateLeasesMultiple takes a slice of lease.IDVersion so that's why I didn't use it. I'm thinking it should probably take a catalog.DescriptorIDSet instead.

@postamar
Copy link
Contributor Author

@ajwerner following your suggestion re. SET CLUSTER SETTINGS I rebased on master and make stress no longer triggers the timeout error on those statements, so I'm pushing that change. Hopefully the same will hold for CI.

Marius Posta added 2 commits February 18, 2021 10:48
Previously, we didn't systematically validate descriptors when they were
written. Furthermore, there existed no common method to validate
descriptors across all descriptor subtypes

This commit adds three methods to the catalog.Descriptor interface:
  1. ValidateSelf ( context.Context ) error
  2. Validate ( context.Context, catalog.DescGetter ) error
  3. ValidateTxnCommit ( context.Context, catalog.DescGetter) error

Each performs a subset the checks performed by the next. ValidateSelf
contains all checks which can be performed in isolation, Validate also
performs all those involving DescGetters (i.e. cross-reference checks)
and ValidateTxnCommit also performs checks which should only be done at
commit-time. An example of the latter is checking that a table has
a primary key: dropping the PK is allowed within a transaction as long
as a new PK is subsequently provided before committing.

This commit adds new validation calls when writing descriptors:
  1. ValidateSelf is called prior to Collection adding a descriptor Put
     to a kv.Batch. At this point, we want descriptors to be at least
     internally-consistent, both to catch validation errors early and
     because it's not possible to do cross-reference checking at this
     point (changes on FKs for instance involve multiple descriptors).
  2. ValidateTxnCommit is called on the descs.Collection's uncommitted
     descriptors when the corresponding txn is about to commit, just
     prior to the two-version-invariant check. The Collection itself is
     used as a catalog.DescGetter.

These validations may be disabled using a new cluster setting:
    sql.catalog.descs.validate_on_write.enabled
Setting this to false makes it possible to corrupt the descriptor state
using the crdb_internal.unsafe_* functions.

Release note: None
Previously, the Validate() method on catalog.SchemaDescriptor did
nothing. This commit adds validation checks that verify that the
parentID exists and that that database descriptor contains the correct
entry in its `schemas` mapping.

Fixes cockroachdb#53439.

Release note: None
Previously this method populated a couple of maps but didn't actually do
anything with them besides that.

Release note: None
@postamar
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

Build succeeded:

@craig craig bot merged commit d67d6cc into cockroachdb:master Feb 18, 2021
@postamar postamar deleted the add-desc-validation-on-write branch February 18, 2021 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants