Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: rollingback add index meets panic will lead json unmarshal object error #23321

Closed
AilinKid opened this issue Mar 15, 2021 · 3 comments · Fixed by #23913
Closed

ddl: rollingback add index meets panic will lead json unmarshal object error #23321

AilinKid opened this issue Mar 15, 2021 · 3 comments · Fixed by #23913
Assignees
Labels
severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@AilinKid
Copy link
Contributor

AilinKid commented Mar 15, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/tidb_ghpr_check/detail/tidb_ghpr_check/75909/pipeline/

On every occasional way, when I cherry-pick the pr to release-4.0, I meet a very strange test failure with message info like

[2021-03-15T07:13:17.203Z] db_integration_test.go:1308:
[2021-03-15T07:13:17.203Z]     s.tk.MustGetErrCode(sql, errno.ErrDupEntry)
[2021-03-15T07:13:17.203Z] /home/jenkins/agent/workspace/tidb_ghpr_check/go/src/github.com/pingcap/tidb/util/testkit/testkit.go:300:
[2021-03-15T07:13:17.203Z]     tk.c.Assert(int(sqlErr.Code), check.Equals, errCode, check.Commentf("Assertion failed, origin err:\n  %v", sqlErr))
[2021-03-15T07:13:17.203Z] ... obtained int = 1105
[2021-03-15T07:13:17.203Z] ... expected int = 1062
[2021-03-15T07:13:17.203Z] ... Assertion failed, origin err:
[2021-03-15T07:13:17.203Z]   ERROR 1105 (HY000): json: cannot unmarshal object into Go value of type bool

At the first beginning, I thought it has nothing to do with my PR. After several hours of investigating it, I found it is a quite rare test case triggered by my global failpoint panic in the DDL's drop index logic, which should only be used in the serialTestSuite but I forgot.

1. Minimal reproduce step (Required)

Anyway, thanks to the misused failpoint, we found a very wired DDL problem, caused by two different canceling DDL entrances.

Here is the error's triggering logic.

1: when add-index meets error in reorg state (which could be [kv:1062]Duplicate entry )

2: since this error could NOT be ignored, we convert this add-index to a rollingback job (which will rewrite the job.args and schema-state for drop-index logic, and set the job.state as JobStateRollingback )

3: In the next DDL round, the add-index job with JobStateRollingback state will walk through the rollingback logic in onCreateIndex function, which will step into onDropIndex.

4: If there is panic in the drop-index logic, this ddl job will be pulled up and set the job.state as canceling (but its job.args has already been rewritten for rollingback logic)

5: in the next DDL round, this half-way rollingback job will be recognized as a brand-new canceling job again, which will step into unified canceling entrance convertJob2RollbackJob and it will take it for granted that it's args is what it used to be, because it's job.Type is add-index. The decoding error is finally stored as a job error here.

2. What did you expect to see? (Required)

A job that has been pulled up from a panic should be handled correctly. Maybe we should unify the canceling entrance.

3. What did you see instead (Required)

A very strange decoding error for DDL failure and the remained index hasn't been cleaned either, which is hard to locate and reproduced.

4. What is your TiDB version? (Required)

master, 4.0...

@AilinKid AilinKid added the type/bug The issue is confirmed as a bug. label Mar 15, 2021
@AilinKid
Copy link
Contributor Author

PTAL @zimulala @bb7133

@Howie59
Copy link
Contributor

Howie59 commented Mar 19, 2021

/assign

@ti-srebot
Copy link
Contributor

Please edit this comment or add a new comment to complete the following information

Not a bug

  1. Remove the 'type/bug' label
  2. Add notes to indicate why it is not a bug

Duplicate bug

  1. Add the 'type/duplicate' label
  2. Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

6. Fixed versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants