Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [0221]Tpcc 100w 1000t test on tke report 'w-w conflict'. #14617

Closed
1 task done
Ariznawlll opened this issue Feb 21, 2024 · 13 comments
Closed
1 task done

[Bug]: [0221]Tpcc 100w 1000t test on tke report 'w-w conflict'. #14617

Ariznawlll opened this issue Feb 21, 2024 · 13 comments
Assignees
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@Ariznawlll
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

1.1-dev

Commit ID

443d443

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7967811542/job/21754324166

image

log: http://175.178.192.213:30088/explore?panes=%7B%22AAL%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22branch-reg-443d443%5C%22%7D%20%7C%3D%20%60w-w%20conflict%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1

出问题的commit位于下面的几个commit之中(正在二分..)
image

Expected Behavior

No response

Steps to Reproduce

trigger test on tke using commit '1.1-dev'.

Additional information

No response

@Ariznawlll Ariznawlll added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Feb 21, 2024
@Ariznawlll Ariznawlll added this to the 1.2.0 milestone Feb 21, 2024
@aptend aptend assigned aptend and unassigned matrix-meow Feb 21, 2024
@aptend
Copy link
Contributor

aptend commented Feb 28, 2024

在 128 上复现一次 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8076839865/job/22065987668

94683:{"level":"INFO","time":"2024/02/28 15:41:57.273472 +0800","name":"log-service.frontend","caller":"frontend/mysql_cmd_executor.go:3709","msg":"time of Exec.Build : 7.875111453s","uuid":"7c4dccb4-4d3c-41f8-b482-5251dc7a41bf","session_info":"connectionId 934|127.0.0.1:45286|{account sys:dump:moadmin -- 0:1:0}|goRoutineId 10180|018deea6-5422-7e6c-b3f9-84684847a53e","session_id":"018deea6-5422-7e6c-b3f9-84684847a53e","statement_id":"018deea9-9bd6-7301-9db3-b5dd7dca2843","txn_id":"018deea99bd673d085cf4b9f5421dc3a/Active/S:1709106109394056670-1"}
...
96849:{"level":"INFO","time":"2024/02/28 15:42:04.505037 +0800","caller":"txnimpl/table.go:312","msg":"[Start]","txn-start-ts":"1709106124355213129-1","operation":"transfer-deletes","operand":"BLK<272490-272499-018deea8-7a32-7328-9cbe-545c46d1b3e6-0-0>","phase":"Phase_Freeze"}
96860:{"level":"INFO","time":"2024/02/28 15:42:04.505297 +0800","caller":"txnimpl/table.go:274","msg":"depth-0 bmsql_stock transfer delete from blk-018deea8-7a32-7328-9cbe-545c46d1b3e6-0-0 row-7572 to blk-018deea9-c619-7179-a955-2fde70311a7f-0-1 row-1042"}
96861:{"level":"INFO","time":"2024/02/28 15:42:04.519764 +0800","caller":"txnimpl/table.go:320","msg":"[End]","txn-start-ts":"1709106124355213129-1","operation":"transfer-deletes","operand":"BLK<272490-272499-018deea8-7a32-7328-9cbe-545c46d1b3e6-0-0>","phase":"Phase_Freeze"}
96862:{"level":"INFO","time":"2024/02/28 15:42:04.519820 +0800","caller":"txnimpl/table.go:312","msg":"[Start]","txn-start-ts":"1709106124355213129-1","operation":"transfer-deletes","operand":"BLK<272490-272499-018deea9-0dd3-7c90-bf10-b79baeaade1b-0-0>","phase":"Phase_Freeze"}
96863:{"level":"WARN","time":"2024/02/28 15:42:04.519874 +0800","caller":"txnimpl/table.go:800","msg":"[txn018DEEA99BD673D085CF4B9F5421DC3A,ts=1709106124355213129-1]: table-272499 blk-018deea9-c619-7179-a955-2fde70311a7f-0-1 delete rows [1095,1095] pk [T=VARCHAR][1]: 3a15243a168783[false]"}
96864:{"level":"INFO","time":"2024/02/28 15:42:04.519894 +0800","caller":"txnimpl/table.go:320","msg":"[End]","txn-start-ts":"1709106124355213129-1","operation":"transfer-deletes","operand":"BLK<272490-272499-018deea9-0dd3-7c90-bf10-b79baeaade1b-0-0>","phase":"Phase_Freeze","error":"w-w conflict"}

目前出现的两次 w-w conflict,共同点是:

  1. 事务存在重试(相同txnid,但是startts有变化)
  2. w-w conflict 报错在用户事务 transfer delete 过程中,已经找到对应目的行,但是发现已经被删除过一次

目前猜测因为重试,送了重复的rowid(新旧两个rowid,指向的是同一个主键)进行删除。下一步需要查看目标 blk 的 deletechain 做确定

@guguducken
Copy link
Contributor

guguducken commented Mar 10, 2024

@aptend
Copy link
Contributor

aptend commented Mar 11, 2024

maybe repro in main: image https://github.com/matrixorigin/matrixone/actions/runs/8219899473/job/22478960672

冲突位置: 删除 mo_increment_columns 的主键行 (286760,__mo_fake_pk_col),其中 286760 为表 ssb_10g.lineorder

冲突过程:
事务1, startts 1710052869593179688-1,committs 1710052869641162429-5,事务 id 未知,完成了该行的 update:删除 rowid 018e2716-bb04-7f3b-9f7d-8625b25be22b-0-0-23, 插入 rowid 018e2716-bb04-7f3b-9f7d-8625b25be22b-0-0-24

事务2, startts 为 1710052869632045138-1,事务id 018E2718060A741C9AAAF8C4F7ABC905,在事务1完成commit后执行删除 018e2716-bb04-7f3b-9f7d-8625b25be22b-0-0-23,因此 w-w,commit 失败

补充:时间戳对应关系

ts: 1710052869593179688
2024-03-10 14:41:09.593000
ts: 1710052869641162429
2024-03-10 14:41:09.641000
ts: 1710052869632045138
2024-03-10 14:41:09.632000

@triump2020
Copy link
Contributor

triump2020 commented Mar 11, 2024

Maybe leads to #14880 #14405 #14562

@aptend
Copy link
Contributor

aptend commented Mar 15, 2024

还没有串起完整的链路。目前有两种 w-w:

  1. 在 transfer delete 阶段,case https://github.com/matrixorigin/mo-nightly-regression/actions/runs/8076839865/job/22065987668
  2. 在普通的删除事务执行中,case https://github.com/matrixorigin/matrixone/actions/runs/8219899473/job/22478960672

目前已知存在的问题是获取元数据变动的时间戳范围计算有误,#14981 尝试修复

@aptend aptend mentioned this issue Mar 19, 2024
7 tasks
@aptend
Copy link
Contributor

aptend commented Mar 20, 2024

ssb load w-w conflict 有复现 https://github.com/matrixorigin/matrixone/actions/runs/8338317388/job/22819471163

目前看这个模式比较稳定,尝试在本地复现

单机循环100次 “新建租户 - 导入 1g ssb - 删除租户" 流程,未复现
单机循环100次 “新建租户 - 导入 10g ssb - 删除租户" 流程,未复现

@triump2020
Copy link
Contributor

triump2020 commented Mar 22, 2024

我这边发现的一个ww 问题,是跟 CN 这变的transferRowid 逻辑有关,正在修复. 场景: update + delele

@aptend
Copy link
Contributor

aptend commented Mar 25, 2024

load ssb 的 w-w 问题,原因是 CN 侧查找非系统租户的自增列表时的逻辑有误,无法找到,导致没有事务没有正确重试,未看到自增列的最新修改,送重复数据到 TN 提交,出现 w-w。 修复 pr #15082

@aptend
Copy link
Contributor

aptend commented Mar 28, 2024

最新main的w-w原因还在看

@aptend aptend mentioned this issue Apr 2, 2024
7 tasks
@aptend
Copy link
Contributor

aptend commented Apr 8, 2024

b915762 修复

最新main的w-w原因还在看

@aptend aptend assigned Ariznawlll and unassigned aptend Apr 10, 2024
@Ariznawlll
Copy link
Contributor Author

testing

@Ariznawlll
Copy link
Contributor Author

最近两天没出现,先关掉。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

6 participants