Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lotus' api ChainNotify sometimes keeps blocking #6883

Closed
freek99 opened this issue Jul 27, 2021 · 21 comments
Closed

Lotus' api ChainNotify sometimes keeps blocking #6883

freek99 opened this issue Jul 27, 2021 · 21 comments
Labels
area/api kind/enhancement Kind: Enhancement P2 P2: Should be resolved

Comments

@freek99
Copy link

freek99 commented Jul 27, 2021

Lotus' api ChainNotify is called in WindowPoStScheduler.Run, sometimes it keeps blocking, causing WindowPoSt to not run normally!

Here some logs have been added to track whether ChainNotify is blocked:

func (s *WindowPoStScheduler) Run(...) {
...
log.Info("s.api.ChainNotify(ctx) start")
notifs, err = s.api.ChainNotify(ctx)
log.Info("s.api.ChainNotify(ctx) end")
...
}

It will print normally:

s.api.ChainNotify(ctx) start
s.api.ChainNotify(ctx) end

But sometimes it just prints:

s.api.ChainNotify(ctx) start

At this time WindowPoSt cannot work normally!

The following is the log information:

{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.160.50:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.22:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.36:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.44:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.170.39:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.26:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.170.49:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.40:3456/rpc/v0\": context deadline exceeded"}
{"level":"error","ts":"2021-07-27T13:06:26.580+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/websocket.go:498","msg":"sending ping message: write tcp 127.0.0.1:56796->127.0.0.1:1234: use of closed network connection"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.170.42:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:31.585+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.41:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.854+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"warn","ts":"2021-07-27T13:06:42.854+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:113","msg":"window post scheduler notifs channel closed"}
{"level":"info","ts":"2021-07-27T13:06:42.854+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:94","msg":"s.api.ChainNotify(ctx) start"}
{"level":"warn","ts":"2021-07-27T13:06:42.854+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"error","ts":"2021-07-27T13:06:42.856+0800","logger":"miner","caller":"miner/miner.go:256","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"warn","ts":"2021-07-27T13:06:42.856+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.203.13:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.864+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.27:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.864+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.53:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.864+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.170.45:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.866+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.160.18:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.866+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.11:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.866+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.150.13:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.866+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.26:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.866+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.150.20:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.170.50:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.160.38:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.867+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.150.50:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:42.869+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://192.168.140.47:3456/rpc/v0\": context deadline exceeded"}

System info:

Os version: 18.04.1-Ubuntu
Lotus version:1.10.1
@yhio
Copy link
Contributor

yhio commented Jul 27, 2021

we often met the same problem when batch and aggregate message

@hunjixin
Copy link
Contributor

@freek99 check lotus log . has anything like head change sub is slow, has %d buffered entries

@freek99
Copy link
Author

freek99 commented Jul 28, 2021

@freek99 check lotus log . has anything like head change sub is slow, has %d buffered entries

@hunjixin No such log information!

@Derek-zd
Copy link

workerlog:

2021-07-30T00:05:30.688 INFO filecoin_proofs::api::seal > seal_commit_phase1:start: SectorId(197018)
2021-07-30T00:05:31.422 INFO filecoin_proofs::api::seal > seal_commit_phase1:finish: SectorId(197018)

minerlog:

{"level":"info","ts":"2021-07-30T00:11:40.229+0800","logger":"advmgr","caller":"sector-storage/worker_tracked.go:57","msg":"track time 197018-C1 took: 6m9.561621024s WorkerID: f7290381-7996-4251-9674-47e5a409294e Hostname: *"}

@freek99
Copy link
Author

freek99 commented Jul 30, 2021

@Stebalien
There is a scenario that when the aggregation function is turned on, a large number of C1 occurs in a short time, and the time consumed by C1 becomes very long. It takes a few minutes, usually more than ten seconds! At this time, ChainNotify is blocked. !

@Stebalien
Copy link
Member

@freek99 a stack trace when this is happening would be really helpful. The (hidden) lotus pprof goroutines command will give this to you.

@freek99
Copy link
Author

freek99 commented Jul 30, 2021

@freek99 a stack trace when this is happening would be really helpful. The (hidden) lotus pprof goroutines command will give this to you.

@Stebalien
We have done some (not lotus pprof goroutines,but lotus-miner pprof goroutines), here are some key outputs:

(The complete output file is relatively large (more than 50M), if you need it, how can I get it to you?)

goroutine 1128 [select, 5 minutes]:
github.com/filecoin-project/go-jsonrpc.websocketClient.func2(0x3dd1d80, 0xc02b8ef0e0, 0x38df93c, 0x3, 0xc35e17bd70, 0xc11536c1f8, 0x14, 0x57cc770, 0x0, 0x0, ...)
	/root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:233 +0x1fe
github.com/filecoin-project/go-jsonrpc.(*client).sendRequest(0xc00119b140, 0x3dd1d80, 0xc02b8ef0e0, 0x38df93c, 0x3, 0xc35e17bd70, 0xc11536c1f8, 0x14, 0x57cc770, 0x0, ...)
	/root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:414 +0x196
github.com/filecoin-project/go-jsonrpc.(*rpcFunc).handleRpcCall(0xc0011acd70, 0xc1b4721d88, 0x1, 0x1, 0x0, 0x0, 0x0)
	/root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:526 +0x5ff
github.com/filecoin-project/lotus/api.(*FullNodeStruct).ChainNotify(0xc00121a000, 0x3dd1cd8, 0xc0109ae5c0, 0x0, 0xc030be1f60, 0x1)
	/root/workspace/lotus/api/proxy_gen.go:1220 +0x42
github.com/filecoin-project/lotus/storage.(*WindowPoStScheduler).Run(0xc003803980, 0x3dd1cd8, 0xc0109ae5c0)
	/root/workspace/lotus/storage/wdpost_sched.go:96 +0x14a
created by github.com/filecoin-project/lotus/node/modules.StorageMiner.func1.1
	/root/workspace/lotus/node/modules/storageminer.go:247 +0x67

@Stebalien
Copy link
Member

Please run both and post the full logs. That single stack trace shows me that your call to chain notify isn't getting any results but doesn't tell me where it's stuck on the other side.

@Stebalien Stebalien reopened this Jul 30, 2021
@Stebalien
Copy link
Member

(sorry, wrong button)

@beck-8
Copy link
Contributor

beck-8 commented Jul 31, 2021

@Stebalien

I have the same problem
this problem often occurs, sometimes it can be recovered, sometimes it cannot be recovered, only to restart the miner

{"level":"warn","ts":"2021-07-31T15:02:20.464+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":27,"forRound":980285,"baseEpoch":980284,"baseDeltaSeconds":6
,"nullRounds":0,"lateStart":true,"beaconEpoch":1076129,"lookbackEpochs":900,"networkPowerAtLookback":"9575549025580548096","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"erro
r":null}
{"level":"warn","ts":"2021-07-31T15:02:50.235+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":6,"forRound":980286,"baseEpoch":980285,"baseDeltaSeconds":6
,"nullRounds":0,"lateStart":true,"beaconEpoch":1076130,"lookbackEpochs":900,"networkPowerAtLookback":"9575549025580548096","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"error
":null}
{"level":"warn","ts":"2021-07-31T15:03:20.105+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":5,"forRound":980287,"baseEpoch":980286,"baseDeltaSeconds":6
,"nullRounds":0,"lateStart":true,"beaconEpoch":1076131,"lookbackEpochs":900,"networkPowerAtLookback":"9575606578142314496","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"error
":null}
{"level":"error","ts":"2021-07-31T15:03:55.971+0800","logger":"miner","caller":"miner/miner.go:256","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"warn","ts":"2021-07-31T15:03:55.971+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:113","msg":"window post scheduler notifs channel closed"}
{"level":"warn","ts":"2021-07-31T15:03:55.971+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"warn","ts":"2021-07-31T15:03:55.971+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"error","ts":"2021-07-31T15:03:55.972+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:101","msg":"ChainNotify error: handler: websocket connection closed"}
{"level":"info","ts":"2021-07-31T15:03:56.972+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-07-31T15:03:56.973+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-07-31T15:03:56.972+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-07-31T15:03:56.973+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-07-31T15:03:57.973+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"info","ts":"2021-07-31T15:03:57.973+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"warn","ts":"2021-07-31T15:04:20.065+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":5,"forRound":980289,"baseEpoch":980288,"baseDeltaSeconds":6
,"nullRounds":0,"lateStart":true,"beaconEpoch":1076133,"lookbackEpochs":900,"networkPowerAtLookback":"9575766694523109376","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"error
":null}

@Stebalien
Copy link
Member

Stebalien commented Jul 31, 2021 via email

@beck-8
Copy link
Contributor

beck-8 commented Aug 1, 2021

I still need the pprof profiles taken when this is happening.

@Stebalien

If I get it, I will send it to you as soon as possible
Just now, its link was closed again, the scene is as follows😭

I want to try to reset lotus datastore,to solve this problem. I know I'm whimsical

root@miner:/media/nvme# lotus chain export --skip-old-msgs --recent-stateroots 900 chain.car
2021-08-01T13:12:23.568+0800    ERROR   rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/websocket.go:498     sending ping message: write tcp 10.10.20.109:49080->10.10.20.109:1234: use of closed network connection
2021-08-01T13:12:23.568+0800    ERROR   rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/websocket.go:667     Connection timeout      {"remote": "10.10.20.109:1234"}
2021-08-01T13:12:23.568+0800    WARN    rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/websocket.go:678     failed to write close message: write tcp 10.10.20.109:49080->10.10.20.109:1234: use of
closed network connection
2021-08-01T13:12:23.568+0800    WARN    rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/websocket.go:681     websocket close error   {"error": "close tcp 10.10.20.109:49080->10.10.20.109:1234: use of closed network connection"}
ERROR: incomplete export (remote connection lost?)
{"level":"warn","ts":"2021-08-01T13:11:20.053+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":6,"forRound":982943,"baseEpoch":982942,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":true,"beaconEpoch":1078787,"lookbackEpochs":900,"networkPowerAtLookback":"9640644729113444352","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"error":null}
{"level":"warn","ts":"2021-08-01T13:11:50.234+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":6,"forRound":982944,"baseEpoch":982943,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":true,"beaconEpoch":1078788,"lookbackEpochs":900,"networkPowerAtLookback":"9640644729113444352","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"error":null}
{"level":"warn","ts":"2021-08-01T13:12:27.568+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:113","msg":"window post scheduler notifs channel closed"}
{"level":"error","ts":"2021-08-01T13:12:27.568+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:101","msg":"ChainNotify error: handler: websocket connection closed"}
{"level":"warn","ts":"2021-08-01T13:12:27.568+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"warn","ts":"2021-08-01T13:12:27.568+0800","logger":"events","caller":"events/events.go:106","msg":"listenHeadChanges quit"}
{"level":"error","ts":"2021-08-01T13:12:27.568+0800","logger":"miner","caller":"miner/miner.go:256","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"info","ts":"2021-08-01T13:12:28.569+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:28.569+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:28.569+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:28.569+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:29.569+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"info","ts":"2021-08-01T13:12:29.569+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:29.570+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"error","ts":"2021-08-01T13:12:29.570+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:30.570+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:30.570+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:30.570+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:30.570+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:31.571+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:31.571+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:31.571+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:31.571+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"error","ts":"2021-08-01T13:12:32.569+0800","logger":"miner","caller":"miner/miner.go:256","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"info","ts":"2021-08-01T13:12:32.572+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:32.572+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:32.572+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:32.572+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:33.572+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:33.572+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:33.572+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:33.572+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:34.573+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}                                                       
{"level":"error","ts":"2021-08-01T13:12:34.573+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:34.573+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:34.573+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:35.573+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:35.573+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:35.573+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:35.573+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:36.574+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"info","ts":"2021-08-01T13:12:36.574+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"error","ts":"2021-08-01T13:12:36.574+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"error","ts":"2021-08-01T13:12:36.574+0800","logger":"events","caller":"events/events.go:104","msg":"listen head changes errored: listenHeadChanges ChainNotify call failed: handler: websocket con
nection closed"}
{"level":"info","ts":"2021-08-01T13:12:37.574+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"info","ts":"2021-08-01T13:12:37.574+0800","logger":"events","caller":"events/events.go:115","msg":"restarting listenHeadChanges"}
{"level":"warn","ts":"2021-08-01T13:12:37.576+0800","logger":"events","caller":"events/events.go:151","msg":"tsc.add: adding current tipset failed: tipSetCache.add: expected new tipset height to be at lea
st 982944, was 982943"}
{"level":"warn","ts":"2021-08-01T13:12:37.577+0800","logger":"events","caller":"events/events.go:151","msg":"tsc.add: adding current tipset failed: tipSetCache.add: expected new tipset height to be at lea
st 982944, was 982943"}
{"level":"warn","ts":"2021-08-01T13:12:50.703+0800","logger":"miner","caller":"miner/miner.go:507","msg":"completed mineOne","tookMilliseconds":19,"forRound":982946,"baseEpoch":982945,"baseDeltaSeconds":6
,"nullRounds":0,"lateStart":true,"beaconEpoch":1078790,"lookbackEpochs":900,"networkPowerAtLookback":"9640694000978264064","minerPowerAtLookback":"82944408420352","isEligible":true,"isWinner":false,"erro
r":null}

@jennijuju jennijuju added area/api kind/bug Kind: Bug kind/enhancement Kind: Enhancement and removed kind/bug Kind: Bug labels Aug 2, 2021
@jennijuju jennijuju added this to the v1.11.2 milestone Aug 2, 2021
@freek99
Copy link
Author

freek99 commented Aug 10, 2021

Please run both and post the full logs. That single stack trace shows me that your call to chain notify isn't getting any results but doesn't tell me where it's stuck on the other side.

@Stebalien

We are waiting to collect goroutine information after the scene reappears!
When the blocking occurs, the lotus daemon side may not be blocked, and the rpc request is not sent to the lotus side! On the miner side, not only ChainNotify is blocked, but other rpc api calls are also blocked! One of the focus may be this place:

/root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:233

Many rpc api calls here take a long time.

goroutine 20126738 [select, 1 minutes]:
github.com/filecoin-project/go-jsonrpc.websocketClient.func2(0x3dd1d80, 0xc031fe9ce0, 0x38df93c, 0x3, 0xc1b51fe1d8, 0xc001123d58, 0x15, 0xc119fbe000, 0x4, 0x4, ...)
        /root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:233 +0x1fe
github.com/filecoin-project/go-jsonrpc.(*client).sendRequest(0xc00119b140, 0x3dd1d80, 0xc031fe9ce0, 0x38df93c, 0x3, 0xc1b51fe1d8, 0xc001123d58, 0x15, 0xc119fbe000, 0x4, ...)
        /root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:414 +0x196
github.com/filecoin-project/go-jsonrpc.(*rpcFunc).handleRpcCall(0xc0011b1630, 0xc02ece0a00, 0x5, 0x5, 0x0, 0x0, 0x0)
        /root/go/pkg/mod/github.com/filecoin-project/go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/client.go:526 +0x5ff
github.com/filecoin-project/lotus/api.(*FullNodeStruct).StateWaitMsg(0xc00121a000, 0x3dd1d10, 0xc000058c28, 0xc1b51b32c0, 0x26, 0x5, 0xffffffffffffffff, 0x1, 0x38742c0, 0x38742c0, ...)
        /root/workspace/lotus/api/proxy_gen.go:2260 +0x82
github.com/filecoin-project/lotus/storage.SealingAPIAdapter.StateWaitMsg(0x7fc1f2f61970, 0xc00121a000, 0x3dd1d10, 0xc000058c28, 0xc1b51b32c0, 0x26, 0x0, 0x0, 0x0, 0x0, ...)
        /root/workspace/lotus/storage/adapter_storage_miner.go:107 +0xc6
github.com/filecoin-project/lotus/extern/storage-sealing.(*Sealing).handleCommitWait(0xc003883200, 0x3dd1d10, 0xc000058c28, 0xc0a2e87620, 0x39028d2, 0xa, 0x2f997, 0x8, 0x0, 0xc330591fa0, ...)
        /root/workspace/lotus/extern/storage-sealing/states_sealing.go:754 +0xc4
github.com/filecoin-project/lotus/extern/storage-sealing.(*Sealing).Plan.func1(0x3dd1d10, 0xc000058c28, 0xc0a2e87620, 0x39028d2, 0xa, 0x2f997, 0x8, 0x0, 0xc330591fa0, 0x1, ...)
        /root/workspace/lotus/extern/storage-sealing/fsm.go:26 +0x95
reflect.Value.call(0x3524240, 0xc0a2e87600, 0x13, 0x38e1966, 0x4, 0xc03338cf70, 0x2, 0x2, 0x6743df, 0x3652de0, ...)
        /usr/local/go/src/reflect/value.go:476 +0x8e7
reflect.Value.Call(0x3524240, 0xc0a2e87600, 0x13, 0xc028b83f70, 0x2, 0x2, 0x6a441c, 0xc028b83ed0, 0x6d3100)
        /usr/local/go/src/reflect/value.go:337 +0xb9
github.com/filecoin-project/go-statemachine.(*StateMachine).run.func3(0xc0925dfa60, 0x3dd1d10, 0xc000058c28, 0xc0a2e87620, 0xc0925dfa70, 0xc172457ce0)
        /root/go/pkg/mod/github.com/filecoin-project/go-statemachine@v0.0.0-20200925024713-05bd7c71fbfe/machine.go:102 +0x2c9
created by github.com/filecoin-project/go-statemachine.(*StateMachine).run
        /root/go/pkg/mod/github.com/filecoin-project/go-statemachine@v0.0.0-20200925024713-05bd7c71fbfe/machine.go:100 +0x372

@jennijuju jennijuju added the P2 P2: Should be resolved label Aug 17, 2021
@jennijuju jennijuju modified the milestones: v1.11.2, v1.11.3 Aug 17, 2021
@Stebalien
Copy link
Member

Fixed in #7000. That won't fix the slow subscribers, but it will prevent them from blocking everything else.

@psweiweimao
Copy link

i occur the same issue?why it happend,how to solve it?network is ok,no drop packet
{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.160.50:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:25.392+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.203.22:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.203.36:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.203.44:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.170.39:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.203.26:3456/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2021-07-27T13:06:26.580+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:354","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post "http://192.168.170.49:3456/rpc/v0\": context deadline exceeded"}

@Stebalien
Copy link
Member

@psweiweimao no, that looks unrelated.

@psweiweimao
Copy link

could you tell me why it happened?

@psweiweimao
Copy link

@psweiweimao no, that looks unrelated.

could you tell me why it happened? tks

@Stebalien
Copy link
Member

@psweiweimao I have no idea, please ask on Slack.

@hxg8202
Copy link

hxg8202 commented Sep 28, 2021

i have same proplem. Lotus version:1.10.2, how to solve it ?

@jennijuju jennijuju removed this from the v1.11.3 milestone Oct 19, 2021
@Stebalien Stebalien removed their assignment Oct 19, 2021
@rjan90
Copy link
Contributor

rjan90 commented Apr 21, 2022

Hey! Just wanted to updated everybody in this thread that this issue will be tracked in #8362 going forward. It´s an issue that is on our radar, and one that we really want to find find a fix for, but it´s also one that we would certainly need additional help from everybody.

  1. Ideally we are looking for a very detailed, and an easy way to get a good repro (reliable/easy/fast) on this issue - so if anybody here has some insights to how we can easily repro it, post the steps in wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362.
  2. Goroutine dumps from both sides (lotus daemon & lotus-miner) could be helpful, but ideally we are looking for 1.

@rjan90 rjan90 closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api kind/enhancement Kind: Enhancement P2 P2: Should be resolved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants