wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

phoenixzhua · 2022-03-23T08:28:32Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

Daemon:  1.14.2+mainnet+git.a65386b49.dirty+api1.4.0
Local: lotus-miner version 1.14.2+mainnet+git.a65386b49.dirty

Describe the Bug

Before the network version v15 upgrade, my miner and daemon are on the same box. (only do the Finalize to get the final ~ 32GB data to long-term storage.)

CPU: 2 * EPYC 7402
MEM: 12 * 64GB
GPU: 2 * RTX 3090
OS: Ubuntu 20.04.3 LTS on DELL-MZ7LH960HBJR0D3 960GB
NVMe: Intel P4510 8TB

Everything goes well for 30+ days. We can seal about 500 sectors one day.

After the network version v15 upgrade, when I start sealing, the connection between miner and daemon will timout and close, and window post will fail.

Then I seprate daemon and miner to different boxes. When there is no sealing jobs, window post goes well for days.
But when I start sealing jobs for about 16-18 hours, the connection between miner and daemon will timout and close, and window post will fail.

It seems the miner box is not overload except the heavy buff/cache memory, more than 600GiB

Logging Information

{"level":"error","ts":"2022-03-23T03:56:26.448+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:667","msg":"Connection timeout","remote":"10.0.20.127:1234"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:498","msg":"sending ping message: write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:97","msg":"setting read deadlineset tcp 10.0.20.121:38742: use of closed network connection"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"rpc","caller":"go-jsonrpc@v0.1.5/websocket.go:648","msg":"sendReqest failed (Handle me): write tcp 10.0.20.121:38742->10.0.20.127:1234 use of closed network connection"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4005/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4006/rpc/v0\": context deadline exceeded"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"events","caller":"events/events_called.go:478","msg":"messagesForTs MessagesForBlock failed (ts.H=1655751, Bcid:bafy2bzacecidmf5omr6gigjcxom6kyzm24xytjon2lyvw3lewt7pvrswjd3z6, B.Mcid:bafy2bzacecheqipbxrgxbbg5ysw6pabjyfgf6qp36bf5kmo4rvjw6qtabtdba): handler: websocket connection closed"}
{"level":"error","ts":"2022-03-23T03:56:26.449+0800","logger":"miner","caller":"miner/miner.go:233","msg":"failed to get best mining candidate: handler: websocket connection closed"}
{"level":"warn","ts":"2022-03-23T03:56:26.449+0800","logger":"storageminer","caller":"storage/wdpost_sched.go:124","msg":"window post scheduler notifs channel closed"}

Repo Steps

Run 'lotus daemon and lotus-miner on different boxes'
Do 'sectors pledge script every 3 minutes for 16-18 hours'
See error 'window post scheduler notifs channel closed'

The text was updated successfully, but these errors were encountered:

rjan90 · 2022-03-24T13:02:06Z

Hi @phoenixzhua!

Thanks for the report. I will add labels and will assign it to the right team for analysis.
We might tag you again for more information.

rjan90 · 2022-03-24T15:36:51Z

Adding some additional information here based on the Slack-thread:

It seems like the websocket closes when there is a lot of Finalize/GET-tasks running simultaneously. And if a lot these jobs are running when you are going to do windowPoSt, you will get window post scheduler notifs channel closed and end up not computing the windowPoSt for that deadline/partition.

phoenixzhua · 2022-03-25T02:46:58Z

Adding some additional information here based on the Slack-thread:

It seems like the websocket closes when there is a lot of Finalize/GET-tasks running simultaneously. And if a lot these jobs are running when you are going to do windowPoSt, you will get window post scheduler notifs channel closed and end up not computing the windowPoSt for that deadline/partition.

If a lot these Finalize/GET-tasks are running when doing windowPoSt, I can see the computing the windowPoSt for current deadline has finished verify_window_post: finish in the rust log after window post scheduler notifs channel closed appears in the miner log.
But the SubmitWindowedPoSt message does not landed on chain, and next deadline windowPoSt does not start because I can't see something like this "logger":"storageminer","caller":"storage/wdpost_run.go:229","msg":"Checked sectors","checked":2349,"good":2349 in the miner log.

Some more infromation:
I can see lots of context deadline exceeded error in the miner log before window post scheduler notifs channel closed.

{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.122:2003/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.123:4006/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.128:4006/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.046+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.124:2003/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-03-23T03:50:21.088+0800","logger":"miner","caller":"miner/miner.go:477","msg":"completed mineOne","tookMilliseconds":20,"forRound":1655741,"baseEpoch":1655740,"baseDeltaSeconds":21,"nullRounds":0,"lateStart":true,"beaconEpoch":1751585,"lookbackEpochs":900,"networkPowerAtLookback":"18354297485952450560","minerPowerAtLookback":"2768364120309760","isEligible":true,"isWinner":false,"error":null}
{"level":"warn","ts":"2022-03-23T03:50:33.215+0800","logger":"advmgr","caller":"sector-storage/sched_worker.go:215","msg":"failed to check worker session","error":"RPC client error: sendRequest failed: Post \"http://10.0.20.123:4002/rpc/v0\": context deadline exceeded"}

cryptowhizzard · 2022-04-12T22:09:01Z

This tends to occur under heavy load.

@phoenixzhua do you have swap configured and some statistics on this machine (load metrics) just before and after the error occurs? ( Grafana can give you nice info if you have that running )

Cpu IO and Wasted are good metrics. You can find these with the top command in linux.
Also disk metrics ( iostat -x 1 -m ) to see what the disks are doing where lotus and lotus-miner resides and where (if you have) swap resides.

Do you have NFS mounts configured on the machine? If so, do you see anything usefull in dmesg about NFS timeout errors?

rjan90 · 2022-04-13T10:48:48Z

Hey!

So an update to this is that the devs are hopeful that this issue will be fixed by the upcoming PoSt workers and improved scheduling logic, which will be included in the v1.15.2 release (coming as a release candidate later today).

The PoSt workers will enable you to read the sealed sectors from local storage, or from any remote worker, which should reduce the load on the lotus-miner which is currently causing the channel to close. I will publish the PoSt worker docs later today, but here is a small preview

When a winningPoSt or windowPoSt worker connects to the lotus-miner, it will delegate all winningPoSt or windowPoSt tasks to that worker.
If both tasks are delegated to separate PoSt workers, the lotus-miner will execute no PoSt tasks locally on the miner instance. If a worker is stopped, the lotus-miner instance automatically switches back to local PoSt.

The windowPoSt process requires reading random leaves of all the sealed sectors in a proving deadline. When setting up windowPoSt workers, one needs to consider how the workers can access those files.
The PoSt workers can ask any other worker to read challenges for them, including the lotus-miner process, but it will prefer reading it from local paths.

cryptowhizzard · 2022-04-21T06:36:20Z

Following up running on 1.15.2-rc1 with a Winningpost worker attached and Windowpost worker attached the problem persists.

2022-04-20T07:44:41.398Z ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed
2022-04-20T07:44:41.402Z ERROR events events/events_called.go:478 messagesForTs MessagesForBlock failed (ts.H=1737807, Bcid:bafy2bzacec5wg5g2i4p43kcylslchj25plens6a5fhe7ot3tbgeddq5ikimyi, B.Mcid:bafy2bzaceclj7gmwplxtaj2dv3lulcxqoojlv4o6223j6dojdpacebxcsylxo): handler: websocket connection closed
2022-04-20T07:44:41.386Z WARN storageminer storage/wdpost_sched.go:124 window post scheduler notifs channel closed
2022-04-20T07:44:41.386Z DEBUG rcmgr go-libp2p-resource-manager@v0.2.0/scope.go:470 blocked connection from constraining edge {"scope": "conn-100213", "edge": "transient", "direction": "Inbound", "usefd": true, "stat": {"NumStreamsInbound":2,"NumStreamsOutbound":0,"NumConnsInbound":32,"NumConnsOutbound":0,"NumFD":32,"Memory":0}, "error": "transient: cannot reserve connection: resource limit exceeded"}
2022-04-20T07:44:04.931Z WARN advmgr sector-storage/sched_worker.go:209 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://--------:2349/rpc/v0": context deadline exceeded"}
2022-04-20T07:44:41.409Z DEBUG advmgr sector-storage/sched.go:361 SCHED 9 queued; 80 open windows
2022-04-20T07:44:03.657Z WARN advmgr sector-storage/sched_post.go:177 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://--------:2302/rpc/v0": context deadline exceeded"}
2022-04-20T07:44:41.395Z WARN advmgr sector-storage/sched_worker.go:209 failed to check worker session {"error": "RPC client error: sendRequest failed: Post "http://10.0.18.13:2349/rpc/v0\": context deadline exceeded"}

rjan90 · 2022-04-21T07:37:41Z

Following up running on 1.15.2-rc1 with a Winningpost worker attached and Windowpost worker attached the problem persists.

@cryptowhizzard Can you clarify a bit if these PoSt workers used remote storage access to the sealed sectors, or if they had local paths to the sealed sectors on the servers? https://lotus.filecoin.io/storage-providers/seal-workers/post-workers/#remote-storage-access

cryptowhizzard · 2022-04-21T07:44:15Z

@rjan90 yes the workers are configured with dedicated access to the storage.

This issue causes the wpost scheduler not to commence at all so the post workers will not be involved.

rjan90 · 2022-04-21T07:52:34Z

The reason why I´m asking is that we need a very detailed, and hopefully an easy way to get a good repro (reliable/easy/fast) on this issue. Just posting logs does not necessarily help us move forward with this issue.

So all the extra information everybody can provide about how their setup looks like, how many sectors are currently finilized when this issue happens, and so forth are very much needed for us to get to the bottom of it.

I have closed the duplicate issues, and will be tracking everything in this thread:

Ideally we are looking for a very detailed, and an easy way to get a good repro (reliable/easy/fast) on this issue - so if anybody here has some insights to how we can easily repro it, please post the steps here.
Goroutine dumps from both sides (lotus daemon & lotus-miner) could be helpful, but ideally we are looking for 1.

f8-ptrk · 2022-04-22T14:32:15Z

#5614 maybe look at this and check why the fix implemented doesn't work....

as the websockets obviously do not reconnect - @rjan90 you know as good as all of us that reliable/easy/fast is not what will happen here. it will take days or weeks to get a miner purposefully into a state to force this issue.

best bet in my eyes: 8 core, 256 ram, run chain and miner + pc2 worker and make sure it does as much work as possible - some APs, maybe a markets import in addition, some transfers - at some point this will occur

f8-ptrk · 2022-04-22T14:36:11Z

@cryptowhizzard reported this in v0.7.0 and it seems to be still around! #3830

might just be the same symptoms and a different disease

rjan90 · 2022-04-22T15:06:24Z

@f8-ptrk It is suspected that it's either a go-jsonrpc bug or something in Lotus that uses it wrong! Meaning it could be really hard to uncover and therefore it's unfortunately not as easy as just going through an old PR and see why that did not fix the actual underlying issue, which might be in a dependency.

reliably/easy/fast is subjective, but repro-steps better then finalize a lot of sectors and general log output of the error will help.

cryptowhizzard · 2022-04-23T09:46:07Z

Giving this a second thought i got a hint from someone else.

The problem only occurs (here) when we import a lot of data ( Straight from CLI with lotus import ). Looking closer on my machines i had a few machines where ran scripts with the lotus-miner storage-deals import commands in parallel.

Looking in the log i see some notices about recourses not available ( FD' s ) ->

ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed
ERROR events events/events_called.go:478 messagesForTs MessagesForBlock failed (ts.H=1737807, Bcid:bafy2bzacec5wg5g2i4p43kcylslchj25plens6a5fhe7ot3tbgeddq5ikimyi, B.Mcid:bafy2bzaceclj7gmwplxtaj2dv3lulcxqoojlv4o6223j6dojdpacebxcsylxo): handler: websocket connection closed
2022-04-20T07:44:41.402Z WARN storageminer storage/wdpost_sched.go:124 window post scheduler notifs channel closed
022-04-20T07:44:41.386Z DEBUG rcmgr go-libp2p-resource-manager@v0.2.0/scope.go:470 blocked connection from constraining edge {"scope": "conn-100213", "edge": "transient", "direction": "Inbound", "usefd": true, "stat": {"NumStreamsInbound":2,"NumStreamsOutbound":0,"NumConnsInbound":32,"NumConnsOutbound":0,"NumFD":32,"Memory":0}, "error": "transient: cannot reserve connection: resource limit exceeded"}

This begs the questions , can parallel import cause resource restrains for the miner?

Aloxaf · 2022-04-27T03:06:42Z

Have the same problem, window post scheduler keep restarting...

I modified the code to let the lotus-worker finalize the sector to storage path. But I still get this error.

cryptowhizzard · 2022-04-27T19:17:56Z

@hasanalaref @Aloxaf ... are you guys running with a separate market node or is this lotus-miner in monolith?

Aloxaf · 2022-04-28T09:17:58Z

@cryptowhizzard In monolith.

I finally solved it by add a separate WindowPoSt worker.

magik6k · 2022-09-28T15:06:29Z

That seems to point at an issue in go-jsonrpc, possibly on the server side - maybe in the websocket library something is single-threaded / blocking on large messages?

It would be really helpful to get both side logs with debug log level on the rpc logger.
It would also be very useful to see goroutine dumps on both sides while the rpc output message buffer are getting printed

rjan90 · 2022-09-28T18:05:00Z

Sent you rpc debug logs from both sides, as well as goroutine dumps on Slack!

magik6k · 2023-02-28T09:05:41Z

There was a good amount of improvements made to connection robustness in go-jsonrpc v0.2.0, and lotus v0.20.0 / master is now at go-jsonrpc v0.2.1, so it would be good to see if this issue has improved.

If it still happens, the new go-jsonrpc also added better debugging logs that would be very helpful in nailing down what the issue is here:

Setting the rpc logger to debug will enable most of those logs
JSONRPC_ENABLE_DEBUG_TRACE=1, together with the rpc logger in debug will enable very verbose rpc tracing which could tell us what exactly is happening
^ those would be useful on both lotus and lotus-miner

rjan90 · 2023-02-28T14:18:36Z

I re-ran the repro of calling Filecoin.ListMarketDeal while running a lotus-miner proving compute window-post <deadline> on master, and where able to successfully compute the windowPoSt now, without any window post scheduler notifs channel closed or rpc output message buffer logs - just normal logs.

That makes me cautiously optimistic that it is fixed, but will await some further confirmations 👀

If it still happens, the new go-jsonrpc also added better debugging logs that would be very helpful in nailing down what the issue is here:

Setting the rpc logger to debug will enable most of those logs
JSONRPC_ENABLE_DEBUG_TRACE=1, together with the rpc logger in debug will enable very verbose rpc tracing which could tell us what exactly is happening
^ those would be useful on both lotus and lotus-miner

Will add these to the knowledge base, so we can point users there in case they meet any jsonrpc issues in the future

magik6k · 2023-03-06T17:13:12Z

Should be closed after #10395 lands

RobQuistNL · 2023-03-11T18:23:53Z

The channel itself is not getting closed, but wdpost workers don't work in 1.20.3.

The workers do the work, and when they are done, they fail some random way (context canceled is not really... descriptive)

Then the miner itself tries to do it, but obviously is too late.

17:09:30.147Z wdpost/wdpost_run.go:260","msg":"starting PoSt cycle","cycle":"2023-03-11T17:09:30.147Z","manual":false,"ts":"[bafy2bzacebmdvyya4fqadp76p7bo75qdx53sh35qaboazrl6ncge3x73y2le6]","deadline":25}
17:09:30.147Z wdpost/wdpost_run.go:412","msg":"computing window post","cycle":"2023-03-11T16:44:00.329Z","batch":0,"elapsed":603.633691775,"skip":0,"err":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","errVerbose":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n    github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n        /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_run.go:414","msg":"error generating window post: partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","cycle":"2023-03-11T16:44:00.329Z"}
17:09:30.147Z wdpost/wdpost_run.go:262","msg":"post cycle done","cycle":"2023-03-11T16:44:00.329Z","took":1529.817936669}
17:09:30.147Z wdpost/wdpost_run.go:98","msg":"runPoStCycle failed: running window post failed:\n    github.com/filecoin-project/lotus/storage/wdpost.(*WindowPoStScheduler).runPoStCycle\n        /home/filecoinnew/lotus/storage/wdpost/wdpost_run.go:475\n  - partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n    github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n        /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_changehandler.go:254","msg":"Aborted window post Proving (Deadline: &{CurrentEpoch:2674888 PeriodStart:2673439 Index:24 Open:2674879 Close:2674939 Challenge:2674859 FaultCutoff:2674809 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})"}

RobQuistNL · 2023-03-11T18:41:41Z

WDPost worker explodes with this error;

{"level":"error","ts":"2023-03-11T18:39:29.811Z","logger":"stores","caller":"paths/local.go:806","msg":"failed to generate valilla PoSt proof before context cancellation","err":"context canceled","duration":158.067953517,"cache-id":"xxx-storage2","sealed-id":"xxx-storage2","cache":"/mnt/xxx/storage2/cache/s-t0xxx-176645","sealed":"/mnt/xxx/storage2/sealed/s-t0xxx-176645"}

and

{"level":"error","ts":"2023-03-11T18:39:29.813Z","logger":"advmgr","caller":"sealer/worker_local.go:689","msg":"reading PoSt challenge for sector 176703, vlen:0, err: failed to generate vanilla proof before context cancellation: context canceled"}

rjan90 · 2023-03-11T18:48:22Z

@RobQuistNL The fixes to the channel closed issues are not in v1.20.3, but are coming in v1.21.0 which updates go-jsonrpc to v0.2.2:

deps: Update go-jsonrpc to v0.2.2 #10395

The v1.21.0 release will also include quite a bit of fixes for wdPoSt worker assigning logic

cryptowhizzard · 2023-03-11T22:32:50Z

@RobQuistNL The fixes to the channel closed issues are not in v1.20.3, but are coming in v1.21.0 which updates go-jsonrpc to v0.2.2:

deps: Update go-jsonrpc to v0.2.2 #10395

The v1.21.0 release will also include quite a bit of fixes for wdPoSt worker assigning logic

@RobQuistNL

Rjan90 is not on the same page here. There are new thing introduced in lotus-miner.

Your window breaks because the default timeout for vanilla is 600 secs now. That is way to low. After setting it higher here the problem resolved. I set it to 22 minutes here.

cheers!

[Proving]
#ParallelCheckLimit = 128
SingleCheckTimeout = "22m0s"
#PartitionCheckTimeout = "20m0s"
DisableBuiltinWindowPoSt = false
DisableBuiltinWinningPoSt = false

#DisableWDPoStPreChecks = false
#MaxPartitionsPerPoStMessage = 0
#MaxPartitionsPerRecoveryMessage = 0
#SingleRecoveringPartitionPerPostMessage = false

RobQuistNL · 2023-03-11T22:52:06Z

Actually, you know what resolved it?

Removing the long-term storage from my PC1 worker machines.

Makes no sense, or does it? #10454

donkabat · 2023-03-12T11:12:43Z

@RobQuistNL
I use below setting for config with wdpost workers and long storage workers for several months without any problems:

[Proving]
ParallelCheckLimit = 66
SingleCheckTimeout = "20m0s"
#PartitionCheckTimeout = "20m0s"

DisableBuiltinWindowPoSt = true
DisableBuiltinWinningPoSt = true
DisableWDPoStPreChecks = true

and wdpost workers run with --post-parallel-reads=66

RobQuistNL · 2023-03-13T00:48:57Z

I think the problem is that somehow workers request where a sector is (despite it never moving) through the miner, which is then scanning through thousands of listed items, and then continues to fetch through the miner, which is slowing things down.

Lotus' storage management is probably one of the major culprits. E.g. #8783

RobQuistNL · 2023-03-13T00:50:51Z

I haven't changed any of those settings, and it simply boils down to this:
Have the PC1 workers attached to long term storage -> wdposts take too long & fail, miner has constant network usage, high CPU, (each pc2 worker has a constant 120mb/s network usage), scheduler hangs, lotus-miner sealing jobs takes 4 minutes to return, sched-diag takes 5 minutes

Remove the long term storage from the PC1 workers -> all works OK (except that sealed sectors are now sent over HTTP through FETCH jobs instead of just moving quickly over NFS shares)

phoenixzhua added kind/bug Kind: Bug need/triage labels Mar 23, 2022

rjan90 added area/proving Area: Proving kind/enhancement Kind: Enhancement and removed area/proving Area: Proving labels Mar 24, 2022

rjan90 removed the need/triage label Mar 25, 2022

rjan90 self-assigned this Mar 28, 2022

rjan90 added the P2 P2: Should be resolved label Apr 20, 2022

rjan90 changed the title ~~window post fail when sealing sectors~~ wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load Apr 20, 2022

This was referenced Apr 21, 2022

window post scheduler notifs channel closed #7653

Closed

Lotus' api ChainNotify sometimes keeps blocking #6883

Closed

[BUG] miner can't receive head change any more after "window post scheduler notifs channel closed". #5813

Closed

magik6k mentioned this issue Oct 14, 2022

fix: post worker sched: Don't check worker session in a busy loop #9495

Merged

7 tasks

RobQuistNL mentioned this issue Oct 17, 2022

lotus sync stucked with error "reorg channel is heavily backlogged, waiting a bit before trying to take process new tipsets" #7803

Open

18 tasks

rjan90 mentioned this issue Feb 24, 2023

Lotus-Miner Backlog Sprint #10338

Closed

rjan90 added this to the v1.21.0 milestone Feb 28, 2023

magik6k mentioned this issue Mar 1, 2023

Less write lock contention, better read timeout handling filecoin-project/go-jsonrpc#97

Merged

rjan90 moved this to 👀 In Review in Lotus-Miner-V2 Mar 2, 2023

rjan90 added this to Lotus-Miner-V2 Mar 2, 2023

rjan90 modified the milestones: v1.21.0, Lotus-Miner Backlog Sprint Mar 2, 2023

magik6k closed this as completed in filecoin-project/go-jsonrpc#97 Mar 6, 2023

magik6k mentioned this issue Mar 6, 2023

deps: Update go-jsonrpc to v0.2.2 #10395

Merged

7 tasks

magik6k reopened this Mar 6, 2023

arajasek closed this as completed in #10395 Mar 6, 2023

rjan90 moved this from 👀 In Review to ✅ Done - v1.21.0 in Lotus-Miner-V2 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

phoenixzhua commented Mar 23, 2022

rjan90 commented Mar 24, 2022

rjan90 commented Mar 24, 2022

phoenixzhua commented Mar 25, 2022

cryptowhizzard commented Apr 12, 2022 •

edited

Loading

rjan90 commented Apr 13, 2022 •

edited

Loading

cryptowhizzard commented Apr 21, 2022 •

edited

Loading

rjan90 commented Apr 21, 2022

cryptowhizzard commented Apr 21, 2022

rjan90 commented Apr 21, 2022 •

edited

Loading

f8-ptrk commented Apr 22, 2022 •

edited

Loading

f8-ptrk commented Apr 22, 2022 •

edited

Loading

rjan90 commented Apr 22, 2022

cryptowhizzard commented Apr 23, 2022

Aloxaf commented Apr 27, 2022

cryptowhizzard commented Apr 27, 2022

Aloxaf commented Apr 28, 2022

magik6k commented Sep 28, 2022

rjan90 commented Sep 28, 2022

magik6k commented Feb 28, 2023

rjan90 commented Feb 28, 2023

magik6k commented Mar 6, 2023

RobQuistNL commented Mar 11, 2023

RobQuistNL commented Mar 11, 2023

rjan90 commented Mar 11, 2023

cryptowhizzard commented Mar 11, 2023

RobQuistNL commented Mar 11, 2023 •

edited

Loading

donkabat commented Mar 12, 2023

RobQuistNL commented Mar 13, 2023

RobQuistNL commented Mar 13, 2023

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362

Comments

phoenixzhua commented Mar 23, 2022

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

rjan90 commented Mar 24, 2022

rjan90 commented Mar 24, 2022

phoenixzhua commented Mar 25, 2022

cryptowhizzard commented Apr 12, 2022 • edited Loading

rjan90 commented Apr 13, 2022 • edited Loading

cryptowhizzard commented Apr 21, 2022 • edited Loading

rjan90 commented Apr 21, 2022

cryptowhizzard commented Apr 21, 2022

rjan90 commented Apr 21, 2022 • edited Loading

f8-ptrk commented Apr 22, 2022 • edited Loading

f8-ptrk commented Apr 22, 2022 • edited Loading

rjan90 commented Apr 22, 2022

cryptowhizzard commented Apr 23, 2022

Aloxaf commented Apr 27, 2022

cryptowhizzard commented Apr 27, 2022

Aloxaf commented Apr 28, 2022

magik6k commented Sep 28, 2022

rjan90 commented Sep 28, 2022

magik6k commented Feb 28, 2023

rjan90 commented Feb 28, 2023

magik6k commented Mar 6, 2023

RobQuistNL commented Mar 11, 2023

RobQuistNL commented Mar 11, 2023

rjan90 commented Mar 11, 2023

cryptowhizzard commented Mar 11, 2023

RobQuistNL commented Mar 11, 2023 • edited Loading

donkabat commented Mar 12, 2023

RobQuistNL commented Mar 13, 2023

RobQuistNL commented Mar 13, 2023

cryptowhizzard commented Apr 12, 2022 •

edited

Loading

rjan90 commented Apr 13, 2022 •

edited

Loading

cryptowhizzard commented Apr 21, 2022 •

edited

Loading

rjan90 commented Apr 21, 2022 •

edited

Loading

f8-ptrk commented Apr 22, 2022 •

edited

Loading

f8-ptrk commented Apr 22, 2022 •

edited

Loading

RobQuistNL commented Mar 11, 2023 •

edited

Loading