Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysqld.GetSchema: tolerate tables being dropped while inspecting schema #12641

Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Mar 16, 2023

Description

Identify the two scenarios where a table may be dropped while we're inspecting the schema. Either we get a (errno 1146) error code, or the table does not appear in information_schema even after it did, before.

The fix is to silently ignore tables hitting this scenario.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

hmmmm got this relevant error in Unit_test:

2023-03-16T13:25:15.6015797Z === RUN   TestDeferSecondaryKeys/3FPK1SK_ShardMerge
2023-03-16T13:25:15.6015939Z === RUN   TestDeferSecondaryKeys/0FPK2tSK
2023-03-16T13:25:15.6016109Z     vreplicator_test.go:443: 
2023-03-16T13:25:15.6016645Z         	Error Trace:	/home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/vreplication/vreplicator_test.go:443
2023-03-16T13:25:15.6016885Z         	Error:      	Received unexpected error:
2023-03-16T13:25:15.6017622Z         	            	unexpected number of table definitions returned from GetSchema call for table "t1": 0
2023-03-16T13:25:15.6017881Z         	Test:       	TestDeferSecondaryKeys/0FPK2tSK
2023-03-16T13:25:15.6018020Z === RUN   TestDeferSecondaryKeys/2SKRetryNoErr
2023-03-16T13:25:15.6018181Z === RUN   TestDeferSecondaryKeys/2SKRetryNoErr2
2023-03-16T13:25:15.6018338Z === RUN   TestDeferSecondaryKeys/SKSuperSetNoErr

@shlomi-noach
Copy link
Contributor Author

I'm not sure what the situation is in vreplicator_test.go calling stashSecondaryKeys(), that results with an empty schema. That is, I'm not sure if the new logic is doing the right thing, and this is unexpected to stashSecondaryKeys(), or whether the new logic is doing something wrong. I'm not sure if a table is being dropped here.

@shlomi-noach
Copy link
Contributor Author

Ah, I get it. A table without PRIMARY KEY. Converting this PR to draft while I fix it.

@shlomi-noach shlomi-noach marked this pull request as draft March 16, 2023 13:37
…ped. It can also mean the table does not have PRIMARY KEY

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

OK, the assumption about the 2nd race condition was incorrect. A table that does not appear in INFORMATION_SCHEMA.STATISTICS isn't necessarily dropped. It may also be a table that does not have any keys.

Removed the 2nd race condition handling altogether. It a table does end up being dropped before calculating primary key, it will just have an implicit nil primary key.

@shlomi-noach shlomi-noach marked this pull request as ready for review March 16, 2023 13:41
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating this! Nice debugging work. I only had some minor comments.

// This is fine. We identify the situation and continue to remove the table from our records
sqlErr, isSQLErr := mysql.NewSQLErrorFromError(err).(*mysql.SQLError)
if isSQLErr && sqlErr != nil && sqlErr.Number() == mysql.ERNoSuchTable {
td.Fields = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead remove td from tds here, or set td = nil? I feel like the former would be the right thing to do so that our memory representation tracks the actual live mysqld state (which is the point of that component AFAIUI).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I rewrote this in a different way. Problem is that we're iterating tds concurrently. Any change to tds itself during iteration is less than ideal. So instead I'm building a validTds slice which in turn is the final list of table definitions.

go/vt/mysqlctl/schema.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/schema.go Outdated Show resolved Hide resolved
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving so that you can merge whenever all the tests are passing.

IMO, we should re-run the affected test(s) 5-10 times before merging to be sure that they now pass each time — or at least don't have the same sub-test failure which led to the investigation — and we've fully addressed the issue.

Thanks again for great investigative work!

@@ -109,6 +112,15 @@ func (mysqld *Mysqld) GetSchema(ctx context.Context, dbName string, request *tab

fields, columns, schema, err := mysqld.collectSchema(ctx, dbName, td.Name, td.Type, request.TableSchemaOnly)
if err != nil {
// there's a possible race condition: it could happen that a table was dropped in between reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF you have to do another push for some reason, we can capitalize There's here. It's annoyingly nitty, I know, so no need unless we have to do another commit/push for another reason.

@shlomi-noach
Copy link
Contributor Author

IMO, we should re-run the affected test(s) 5-10 times before merging to be sure

I've ran this on local env some 20 times and it passed all runs. I'm signing out for the week and won't be watching CI to kick restarts. If this can wait until Sunday, I can verify and merge then. If not, anyone should feel free to test/retry and merge.

@mattlord
Copy link
Contributor

IMO, we should re-run the affected test(s) 5-10 times before merging to be sure

I've ran this on local env some 20 times and it passed all runs. I'm signing out for the week and won't be watching CI to kick restarts. If this can wait until Sunday, I can verify and merge then. If not, anyone should feel free to test/retry and merge.

Sounds good. You deserve a clean break for the weekend! 🙂 I can periodically re-run the test, as I'm doing that on another PR too.

@shlomi-noach
Copy link
Contributor Author

Multiple tests are consistently failing on this PR. Example: vreplication_basic: https://github.com/vitessio/vitess/actions/runs/4438816245/jobs/7831448081?pr=12641 with error:

2023-03-19T05:54:35.2013877Z Reshard Error: rpc error: code = Unknown desc = copySchema: CopySchemaShard was not successful because the schemas between the two tablets cell:"zone1" uid:200 and cell:"zone1" uid:600 differ: [dest has an extra table named cproduct dest has an extra table named customer dest has an extra table named db_order_test source has an extra table named db_order_test source has an extra table named cproduct source has an extra table named customer]

vtgate_schema has a similar error:

Mar 19 05:54:00         	Error Trace:	go/test/endtoend/vtgate/schema/schema_test.go:249
Mar 19 05:54:00         	            				go/test/endtoend/vtgate/schema/schema_test.go:110
Mar 19 05:54:00         	Error:      	Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0006de408), Stderr:[]uint8(nil)}
Mar 19 05:54:00         	Test:       	TestSchemaChange

I'm going to try a different angle to this PR. Instead of removing the dropped table from the result set, I'm going to keep it there, with empty table definition, ie empty keys, empty columns etc.

…we keep the table, but with empty column/key/fields info

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

@mattlord @rohit-nayak-ps could you give this another look? I've simplified the change to just returning an incomplete table description. Previously, when I excluded the table description from the result table list, many tests failed, consistently, see above. Now, everything seems to pass.

@shlomi-noach
Copy link
Contributor Author

PS I'm re-running onlineddl_vrepl in CI just to ensure it passes consistently.

@shlomi-noach
Copy link
Contributor Author

So with the current implementation Columns and Fields remain nil. I reviewed all usage and nothing is susceptible to nil pointer access; everything just properly ranges over the slices.

Tests are super happy. I think this should be good.

@rohit-nayak-ps
Copy link
Contributor

@mattlord @rohit-nayak-ps could you give this another look? I've simplified the change to just returning an incomplete table description. Previously, when I excluded the table description from the result table list, many tests failed, consistently, see above. Now, everything seems to pass.

Looks good @shlomi-noach, thanks!

@shlomi-noach shlomi-noach merged commit 0857b7d into vitessio:main Mar 20, 2023
@shlomi-noach shlomi-noach deleted the get-schema-ignore-dropped-tables branch March 20, 2023 09:41
@shlomi-noach
Copy link
Contributor Author

waiting to see if the bot is able to cherry pick this

shlomi-noach added a commit to planetscale/vitess that referenced this pull request Mar 20, 2023
…ma (vitessio#12641)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* populate validTds rather than rely on nil hints

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* re-introdce earlier check

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* use validTds, sync

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* grammar

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Mar 20, 2023
…ma (vitessio#12641)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* populate validTds rather than rely on nil hints

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* re-introdce earlier check

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* use validTds, sync

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* grammar

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Mar 20, 2023
…ma (vitessio#12641)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* populate validTds rather than rely on nil hints

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* re-introdce earlier check

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* use validTds, sync

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* grammar

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit that referenced this pull request Mar 20, 2023
…ma (#12641) (#12666)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema



* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY



* populate validTds rather than rely on nil hints



* re-introdce earlier check



* use validTds, sync



* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info



* grammar



---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit that referenced this pull request Mar 20, 2023
…ma (#12641) (#12665)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema



* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY



* populate validTds rather than rely on nil hints



* re-introdce earlier check



* use validTds, sync



* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info



* grammar



---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit that referenced this pull request Mar 22, 2023
…ma (#12641) (#12664)

* Mysqld.GetSchema: tolerate tables being dropped while inspecting schema



* lack of primary key columns in STATISTICS does not mean table is dropped. It can also mean the table does not have PRIMARY KEY



* populate validTds rather than rely on nil hints



* re-introdce earlier check



* use validTds, sync



* due to many tests consistently failing, trying a different approach: we keep the table, but with empty column/key/fields info



* grammar



---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Mysqld.GetSchema() race condition when table is dropped throughout inspection
4 participants