Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncate table with many partitions with tiflash replica may encounter write conflict and retry #42940

Closed
lcwangchao opened this issue Apr 11, 2023 · 3 comments · Fixed by #42957

Comments

@lcwangchao
Copy link
Collaborator

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. create a table with many partitions
  2. set the table with tiflash replica
  3. truncate table frequency

2. What did you expect to see? (Required)

3. What did you see instead (Required)

Some times you can see some error in log

[2023/04/10 18:37:05.125 +08:00] [INFO] [job_table.go:289] ["[ddl] handle ddl job failed"] [error="[kv:9007]Write conflict, txnStartTS=440696311416356887, conflictStartTS=440696311678500883, conflictCommitTS=440696311678500884, key=[]byte{0x6d, 0x4e, 0x65, 0x78, 0x74, 0x47, 0x6c, 0x6f, 0x62, 0xff, 0x61, 0x6c, 0x49, 0x44, 0x0, 0x0, 0x0, 0x0, 0xfb, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x73}, originalKey=6d4e657874476c6f62ff616c494400000000fb0000000000000073, primary={metaKey=true, key=DB:2, field=Table:1014}, originalPrimaryKey=6d44423a3200000000fb00000000000000685461626c653a3130ff3134000000000000f9, reason=Optimistic [try again later]"] [job="ID:1015, Type:truncate table, State:done, SchemaState:public, SchemaID:2, TableID:712, RowCount:0, ArgLen:2, start time: 2023-04-10 18:36:54.081 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2023/04/10 18:37:05.127 +08:00] [INFO] [ddl_worker.go:944] ["[ddl] run DDL job"] [worker="worker 1, tp general"] [job="ID:1015, Type:truncate table, State:queueing, SchemaState:none, SchemaID:2, TableID:712, RowCount:0, ArgLen:0, start time: 2023-04-10 18:36:54.081 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

Because of write conflict, the ddl job retries and user should wait more time for ddl finish

4. What is your TiDB version? (Required)

master

@lcwangchao lcwangchao added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra severity/major labels Apr 11, 2023
@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Apr 11, 2023
@lcwangchao lcwangchao added affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 and removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-6.1 may-affects-6.5 labels Apr 11, 2023
@lcwangchao
Copy link
Collaborator Author

lcwangchao commented Apr 11, 2023

This is because we alloc new partition id when ddl job is running in a txn:

tidb/ddl/partition.go

Lines 3324 to 3337 in 8eb580e

func truncateTableByReassignPartitionIDs(t *meta.Meta, tblInfo *model.TableInfo) error {
newDefs := make([]model.PartitionDefinition, 0, len(tblInfo.Partition.Definitions))
for _, def := range tblInfo.Partition.Definitions {
pid, err := t.GenGlobalID()
if err != nil {
return errors.Trace(err)
}
newDef := def
newDef.ID = pid
newDefs = append(newDefs, newDef)
}
tblInfo.Partition.Definitions = newDefs
return nil
}

We also alloc job id using the same increase key, so if there are too many ddl jobs, the conflict probability increases.

The reason why is are so many ddl jobs is because the tiflash will send ddl request "update tiflash replica status" to update tiflash meta for each partition after truncate table, so when partition is too many, there will be a lot of jobs.

@bb7133
Copy link
Member

bb7133 commented Apr 11, 2023

Yeah...make sense

But this is not about correctness, am I right?

@mjonss mjonss self-assigned this Apr 11, 2023
@lcwangchao
Copy link
Collaborator Author

Yeah...make sense

But this is not about correctness, am I right?

Yes, it is not about correctness but only performance related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants