Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: cn crashed by fatal "wait latest commit ts failed" during statbility test on distributed mode #16716

Closed
1 task done
aressu1985 opened this issue Jun 6, 2024 · 11 comments

Comments

@aressu1985
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

1.2-dev

Commit ID

e6b2868

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

During statbility test on distributed mode, cn was crashed by fatal :
{"level":"FATAL","time":"2024/06/05 22:09:44.081179 +0000","name":"cn-service.txn","caller":"client/client.go:434","msg":"wait latest commit ts failed","uuid":"65393636-3165-6662-6631-633163326338","error":"waiter is paused","stacktrace":"github.com/matrixorigin/matrixone/pkg/txn/client.(*txnClient).SyncLatestCommitTS\n\t/go/src/github.com/matrixorigin/matrixone/pkg/txn/client/client.go:434\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).maybeWaitCommittedLogApplied\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:154\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:144\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:160\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

mo-log:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22Jyy%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-e6b2868-20240605224953%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221717623352647%22,%22to%22:%221717626935791%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

@aressu1985 aressu1985 added this to the 1.2.1 milestone Jun 6, 2024
@sukki37 sukki37 assigned volgariver6 and unassigned matrix-meow Jun 6, 2024
@aressu1985
Copy link
Contributor Author

aressu1985 commented Jun 7, 2024

[2024-06-06 06:09:44.081 FATAL]
06-07-2024-17-35-08_files_list.zip

[2024-06-06 06:14:54 FATAL]
06-07-2024-17-37-50_files_list.zip

[2024-06-06 06:18:44 FATAL]
06-07-2024-17-38-47_files_list.zip

@XuPeng-SH XuPeng-SH assigned aptend and unassigned volgariver6 Jun 7, 2024
@aptend
Copy link
Contributor

aptend commented Jun 12, 2024

flush 的调度等待时长和执行时长都存在消耗超过预期,复现中

This was referenced Jun 13, 2024
@aptend
Copy link
Contributor

aptend commented Jun 13, 2024

增加日志记录秒级别的flush任务,主要观察两点:1. 任务调度延迟 2. 收集 deletes 的 io 时间

@aptend
Copy link
Contributor

aptend commented Jun 14, 2024

收集delete 的 io 时间过长,修复中

@aptend aptend mentioned this issue Jun 17, 2024
7 tasks
@aptend
Copy link
Contributor

aptend commented Jun 17, 2024

pr前
image

pr后
image

flush时间已大幅减少

@aptend
Copy link
Contributor

aptend commented Jun 18, 2024

daily 耗时均不超过10s, pull logtail 暂未出现耗时过长的情况

@volgariver6
Copy link
Contributor

fixed

@aressu1985
Copy link
Contributor Author

testing

@aressu1985
Copy link
Contributor Author

fxied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants