Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dm-worker hangs after running dm tasks with fast/full validators #7241

Closed
fubinzh opened this issue Sep 28, 2022 · 5 comments · Fixed by #7269
Closed

dm-worker hangs after running dm tasks with fast/full validators #7241

fubinzh opened this issue Sep 28, 2022 · 5 comments · Fixed by #7269
Labels
affects-6.1 affects-6.2 affects-6.3 area/dm Issues or PRs related to DM. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Sep 28, 2022

What did you do?

  1. Run dm tasks with fast and full validators
  2. Run other DM cases, run dmctl query-status to get DM task status

What did you expect to see?

dm-worker should not hang, dmctl query-status should run success

What did you see instead?

dm-worker hangs, dmctl query-status return context deadline exceeded error

/ # /dm-master -V
Release Version: v6.3.0-alpha
Git Commit Hash: 2c12e2e60d786d61419fc19aa28691dbdf525b55
Git Branch: heads/refs/tags/v6.3.0-alpha
UTC Build Time: 2022-09-27 11:01:23
Go Version: go version go1.19.1 linux/amd64
Failpoint Build: false
/ # time /dmctl --master-addr=[127.0.0.1:8261](http://127.0.0.1:8261/) query-status
{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "no sub task started",
            "sourceStatus": {
                "source": "source-0",
                "worker": "dm-dm-worker-2",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
            ]
        },
        {
            "result": false,
            "msg": "[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded",
            "sourceStatus": {
                "source": "source-1",
                "worker": "",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
            ]
        }
    ]
}
real    0m 30.15s
user    0m 0.05s
sys     0m 0.02s
/ # grep WARN /var/lib/dm-master/log/dm-master.log | tail
[2022/09/27 17:48:12.060 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:47:41.958 +00:00] ["time spent"=30.101792535s] [remote=[127.0.0.1:40152](http://127.0.0.1:40152/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:49:28.632 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:44:18.790 +00:00] ["time spent"=5m9.842540218s] [remote=[127.0.0.1:38842](http://127.0.0.1:38842/)] ["response type"=/pb.Master/OperateTask] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:49:29.223 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:48:59.121 +00:00] ["time spent"=30.102463024s] [remote=[127.0.0.1:40665](http://127.0.0.1:40665/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:50:45.723 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:45:36.547 +00:00] ["time spent"=5m9.175809748s] [remote=[127.0.0.1:39335](http://127.0.0.1:39335/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:52:05.587 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:46:56.189 +00:00] ["time spent"=5m9.397952776s] [remote=[127.0.0.1:39842](http://127.0.0.1:39842/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:53:21.104 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:48:11.972 +00:00] ["time spent"=5m9.132926971s] [remote=[127.0.0.1:40346](http://127.0.0.1:40346/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:54:38.529 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:49:29.125 +00:00] ["time spent"=5m9.404252609s] [remote=[127.0.0.1:40868](http://127.0.0.1:40868/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:25:27.517 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:24:57.413 +00:00] ["time spent"=30.103021161s] [remote=[127.0.0.1:59562](http://127.0.0.1:59562/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:27:45.295 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:27:15.185 +00:00] ["time spent"=30.109207704s] [remote=[127.0.0.1:60382](http://127.0.0.1:60382/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:29:11.210 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:28:41.107 +00:00] ["time spent"=30.10232034s] [remote=[127.0.0.1:60906](http://127.0.0.1:60906/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]

Versions of the cluster

master

current status of DM cluster (execute query-status <task-name> in dmctl)

/ # /dm-master -V
Release Version: v6.3.0-alpha
Git Commit Hash: 2c12e2e60d786d61419fc19aa28691dbdf525b55
Git Branch: heads/refs/tags/v6.3.0-alpha
UTC Build Time: 2022-09-27 11:01:23
Go Version: go version go1.19.1 linux/amd64
Failpoint Build: false
@fubinzh fubinzh added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Sep 28, 2022
@fubinzh
Copy link
Author

fubinzh commented Sep 28, 2022

dm-worker logs
work1.dm-worker.log

@lance6716
Copy link
Contributor

When enter DataValidator.stopInner by user action (stop-task, validator stop, etc) or data validating error, stopInner will hold the lock and wait other validator goroutine exit. other validator goroutine may send error at

func (vw *validateWorker) validateTableChange() {
var err error
defer func() {
if err != nil && !isRetryableValidateError(err) {
vw.validator.sendError(terror.ErrValidatorValidateChange.Delegate(err))
or other place. But since the lock is hold by DataValidator.stopInner, reading from error channel is blocked at
func (v *DataValidator) errorProcessRoutine() {
defer v.errProcessWg.Done()
for err := range v.errChan {
v.fillResult(err, true)
When error channel is full (validate worker number > 10), it will enter deadlock.

And when this deadlock is caused by user action, it will further block DM-worker's so query-status can't be responsed.

worker1.goroutine.txt

@lance6716
Copy link
Contributor

And I remember uber's code guide channel size is 0 or 1. If we follow it this bug will be easier to be exposed

@fubinzh
Copy link
Author

fubinzh commented Sep 29, 2022

/severity Major

@fubinzh
Copy link
Author

fubinzh commented Sep 30, 2022

/found automation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 affects-6.2 affects-6.3 area/dm Issues or PRs related to DM. found/automation Bugs found by automation cases severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants