Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: lots of error "cannot commit a orphan transaction" after dn crashed 26 times and recovered at last during stability test on distributed mode #21011

Closed
1 task done
aressu1985 opened this issue Dec 30, 2024 · 3 comments
Assignees
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

2.0-dev

Commit ID

ec75b49

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

lots of error "cannot commit a orphan transaction" after dn crashed 26 times and recovered at last during stability test on distributed mode

TN was crashed by oom from [2024-12-28 16:58:14] , and crashed for 26 times, recovery at [2024-12-28 20:02:49]
but from [2024-12-28 20:10:58], there are lots of "cannot commit a orphan transaction" errors

image

link:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22Ssm%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-ec75b49-202412272153%5C%22%7D%20%7C%3D%20%60cannot%20commit%20a%20orphan%20transaction%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221735385400000%22,%22to%22:%221735389059000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

@aressu1985 aressu1985 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Dec 30, 2024
@aressu1985 aressu1985 added this to the 2.0.2 milestone Dec 30, 2024
@zhangxu19830126
Copy link
Contributor

  1. 负责定时发送心跳的goroutine存活
  2. 没有任何发送心跳的错误日志
  3. 检测出来超时的孤儿事物

这个判断只能是10分钟没有发送心跳。等 @iamlinjunhong 加上每次发送心跳的时候,加入日志,看看间隔到底多久吧。

@iamlinjunhong
Copy link
Contributor

#21077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

4 participants