-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362
wdPoSt scheduler notifs channel closes when lotus-miner is under heavy load #8362
Comments
Hi @phoenixzhua! Thanks for the report. I will add labels and will assign it to the right team for analysis. |
Adding some additional information here based on the Slack-thread:
|
If a lot these Finalize/GET-tasks are running when doing windowPoSt, I can see the computing the windowPoSt for current deadline has finished Some more infromation:
|
This tends to occur under heavy load. @phoenixzhua do you have swap configured and some statistics on this machine (load metrics) just before and after the error occurs? ( Grafana can give you nice info if you have that running ) Cpu IO and Wasted are good metrics. You can find these with the top command in linux. Do you have NFS mounts configured on the machine? If so, do you see anything usefull in dmesg about NFS timeout errors? |
Hey! So an update to this is that the devs are hopeful that this issue will be fixed by the upcoming PoSt workers and improved scheduling logic, which will be included in the v1.15.2 release (coming as a release candidate later today). The PoSt workers will enable you to read the sealed sectors from local storage, or from any remote worker, which should reduce the load on the
|
Following up running on 1.15.2-rc1 with a Winningpost worker attached and Windowpost worker attached the problem persists. 2022-04-20T07:44:41.398Z ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed |
@cryptowhizzard Can you clarify a bit if these PoSt workers used remote storage access to the sealed sectors, or if they had local paths to the sealed sectors on the servers? https://lotus.filecoin.io/storage-providers/seal-workers/post-workers/#remote-storage-access |
@rjan90 yes the workers are configured with dedicated access to the storage. This issue causes the wpost scheduler not to commence at all so the post workers will not be involved. |
The reason why I´m asking is that we need a very detailed, and hopefully an easy way to get a good repro (reliable/easy/fast) on this issue. Just posting logs does not necessarily help us move forward with this issue. So all the extra information everybody can provide about how their setup looks like, how many sectors are currently finilized when this issue happens, and so forth are very much needed for us to get to the bottom of it. I have closed the duplicate issues, and will be tracking everything in this thread:
|
#5614 maybe look at this and check why the fix implemented doesn't work.... as the websockets obviously do not reconnect - @rjan90 you know as good as all of us that best bet in my eyes: 8 core, 256 ram, run chain and miner + pc2 worker and make sure it does as much work as possible - some APs, maybe a markets import in addition, some transfers - at some point this will occur |
@cryptowhizzard reported this in v0.7.0 and it seems to be still around! #3830 might just be the same symptoms and a different disease |
@f8-ptrk It is suspected that it's either a go-jsonrpc bug or something in Lotus that uses it wrong! Meaning it could be really hard to uncover and therefore it's unfortunately not as easy as just going through an old PR and see why that did not fix the actual underlying issue, which might be in a dependency.
|
Giving this a second thought i got a hint from someone else. The problem only occurs (here) when we import a lot of data ( Straight from CLI with lotus import ). Looking closer on my machines i had a few machines where ran scripts with the lotus-miner storage-deals import commands in parallel. Looking in the log i see some notices about recourses not available ( FD' s ) -> ERROR events events/events_called.go:352 event diff fn failed: handler: websocket connection closed This begs the questions , can parallel import cause resource restrains for the miner? |
Have the same problem, window post scheduler keep restarting... I modified the code to let the lotus-worker finalize the sector to storage path. But I still get this error. |
@hasanalaref @Aloxaf ... are you guys running with a separate market node or is this lotus-miner in monolith? |
@cryptowhizzard In monolith. I finally solved it by add a separate WindowPoSt worker. |
@rjan90 can you reproduce it by running that query on your JSON-RPC? |
Oh my!!!! @RobQuistNL 🥇. Reproed by:
|
i doubt this is a problem with the list query tbh you most likely will be able to repro this with a bunch of queries |
That seems to point at an issue in go-jsonrpc, possibly on the server side - maybe in the websocket library something is single-threaded / blocking on large messages?
|
Sent you rpc debug logs from both sides, as well as goroutine dumps on Slack! |
There was a good amount of improvements made to connection robustness in go-jsonrpc v0.2.0, and lotus v0.20.0 / master is now at go-jsonrpc v0.2.1, so it would be good to see if this issue has improved. If it still happens, the new go-jsonrpc also added better debugging logs that would be very helpful in nailing down what the issue is here:
|
I re-ran the repro of calling That makes me cautiously optimistic that it is fixed, but will await some further confirmations 👀
Will add these to the knowledge base, so we can point users there in case they meet any jsonrpc issues in the future |
Should be closed after #10395 lands |
The channel itself is not getting closed, but wdpost workers don't work in 1.20.3. The workers do the work, and when they are done, they fail some random way (context canceled is not really... descriptive) Then the miner itself tries to do it, but obviously is too late. 17:09:30.147Z wdpost/wdpost_run.go:260","msg":"starting PoSt cycle","cycle":"2023-03-11T17:09:30.147Z","manual":false,"ts":"[bafy2bzacebmdvyya4fqadp76p7bo75qdx53sh35qaboazrl6ncge3x73y2le6]","deadline":25}
17:09:30.147Z wdpost/wdpost_run.go:412","msg":"computing window post","cycle":"2023-03-11T16:44:00.329Z","batch":0,"elapsed":603.633691775,"skip":0,"err":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","errVerbose":"partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_run.go:414","msg":"error generating window post: partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled","cycle":"2023-03-11T16:44:00.329Z"}
17:09:30.147Z wdpost/wdpost_run.go:262","msg":"post cycle done","cycle":"2023-03-11T16:44:00.329Z","took":1529.817936669}
17:09:30.147Z wdpost/wdpost_run.go:98","msg":"runPoStCycle failed: running window post failed:\n github.com/filecoin-project/lotus/storage/wdpost.(*WindowPoStScheduler).runPoStCycle\n /home/filecoinnew/lotus/storage/wdpost/wdpost_run.go:475\n - partitionCount:1 err:RPC client error: sendRequest failed: Post \"http://172.16.2.10:45810/rpc/v0\": context canceled:\n github.com/filecoin-project/lotus/storage/sealer.(*Manager).generateWindowPoSt.func2\n /home/filecoinnew/lotus/storage/sealer/manager_post.go:193"}
17:09:30.147Z wdpost/wdpost_changehandler.go:254","msg":"Aborted window post Proving (Deadline: &{CurrentEpoch:2674888 PeriodStart:2673439 Index:24 Open:2674879 Close:2674939 Challenge:2674859 FaultCutoff:2674809 WPoStPeriodDeadlines:48 WPoStProvingPeriod:2880 WPoStChallengeWindow:60 WPoStChallengeLookback:20 FaultDeclarationCutoff:70})"} |
WDPost worker explodes with this error; {"level":"error","ts":"2023-03-11T18:39:29.811Z","logger":"stores","caller":"paths/local.go:806","msg":"failed to generate valilla PoSt proof before context cancellation","err":"context canceled","duration":158.067953517,"cache-id":"xxx-storage2","sealed-id":"xxx-storage2","cache":"/mnt/xxx/storage2/cache/s-t0xxx-176645","sealed":"/mnt/xxx/storage2/sealed/s-t0xxx-176645"} and {"level":"error","ts":"2023-03-11T18:39:29.813Z","logger":"advmgr","caller":"sealer/worker_local.go:689","msg":"reading PoSt challenge for sector 176703, vlen:0, err: failed to generate vanilla proof before context cancellation: context canceled"} |
@RobQuistNL The fixes to the channel closed issues are not in v1.20.3, but are coming in v1.21.0 which updates go-jsonrpc to v0.2.2: The v1.21.0 release will also include quite a bit of fixes for wdPoSt worker assigning logic |
Rjan90 is not on the same page here. There are new thing introduced in lotus-miner. Your window breaks because the default timeout for vanilla is 600 secs now. That is way to low. After setting it higher here the problem resolved. I set it to 22 minutes here. cheers! [Proving] #DisableWDPoStPreChecks = false |
Actually, you know what resolved it? Removing the long-term storage from my PC1 worker machines. Makes no sense, or does it? #10454 |
@RobQuistNL [Proving] DisableBuiltinWindowPoSt = true and wdpost workers run with --post-parallel-reads=66 |
I think the problem is that somehow workers request where a sector is (despite it never moving) through the miner, which is then scanning through thousands of listed items, and then continues to fetch through the miner, which is slowing things down. Lotus' storage management is probably one of the major culprits. E.g. #8783 |
I haven't changed any of those settings, and it simply boils down to this: Remove the long term storage from the PC1 workers -> all works OK (except that sealed sectors are now sent over HTTP through FETCH jobs instead of just moving quickly over NFS shares) |
Checklist
Latest release
, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Describe the Bug
Before the network version v15 upgrade, my miner and daemon are on the same box. (only do the Finalize to get the final ~ 32GB data to long-term storage.)
Everything goes well for 30+ days. We can seal about 500 sectors one day.
After the network version v15 upgrade, when I start sealing, the connection between miner and daemon will timout and close, and window post will fail.
Then I seprate daemon and miner to different boxes. When there is no sealing jobs, window post goes well for days.
But when I start sealing jobs for about 16-18 hours, the connection between miner and daemon will timout and close, and window post will fail.
It seems the miner box is not overload except the heavy buff/cache memory, more than 600GiB
Logging Information
Repo Steps
The text was updated successfully, but these errors were encountered: