Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: wdpost: disabled post worker handling #10394

Merged
merged 5 commits into from
Mar 6, 2023

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Mar 6, 2023

Related Issues

Fixes #9854

Proposed Changes

  • Add an itest reproducing 9854
  • Don't schedule PoSt on disabled workers
  • If there is an RPC connection error coming from a call to a worker, retry on another worker instead of failing the whole PoSt

Additional Info

Could use testing on a real setup

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@magik6k magik6k requested a review from a team as a code owner March 6, 2023 13:48
@magik6k magik6k changed the title Fix/disabled post worker handling fix: wdpost: disabled post worker handling Mar 6, 2023
@rjan90 rjan90 added this to the v1.21.0 milestone Mar 6, 2023
@rjan90
Copy link
Contributor

rjan90 commented Mar 6, 2023

Can confirm that it works. Spinning up two windowPoSt-workers, and disabling one of the PoSt-worker by SIGINT:

See that its disabled in the output:

lotus-miner proving workers
Worker 5f1a2e86-e110-4f71-a65d-da24eea95cbc, host Ubuntu-2004-focal-amd64-base (disabled)
        CPU:  [                                                                ] 0/12 core(s) in use
        RAM:  [|||                                                             ] 5% 3.204 GiB/62.78 GiB
        VMEM: [|||                                                             ] 5% 3.204 GiB/62.78 GiB
Worker bdd50d1b-ead3-4a40-aa90-7b57c896e278, host Ubuntu-2004-focal-amd64-base
        CPU:  [                                                                ] 0/12 core(s) in use
        RAM:  [|||                                                             ] 5% 3.265 GiB/62.78 GiB
        VMEM: [|||                                                             ] 5% 3.265 GiB/62.78 GiB

Run a windowPoSt, and it gets computed successfully:

lotus-miner proving compute windowed-post 2
Took 1.940918446s
[{"Deadline":2,"Partitions":[{"Index":0,"Skipped":[0]}],"Proofs":[{"PoStProof":7,"ProofBytes":"kYPhINy9KaeeylnSqQedF3x5CiSqQO5RipuZE8nX9Fq+OTxdc8XTShsiBrSuhQ4QlevZlMZGJCRCnQgcJNLWbq7Q801yL0Ybu4l2cz95KDjn58Prhc5ADhogdxij6GugCFAcoTWcefGBiGum6rx8y/jl+dD6TK37B7tP3DAY4EuFwX6CvCEVJPMg/yalWS8ji6vo7q6FFmxSvlX0KAho5gWy6fKBIxkPeuIxT+eKzIVyhYJ4FiWy82ikZPdEiXum"}],"ChainCommitEpoch":0,"ChainCommitRand":null}]

In the logs, you can see that it was not scheduled on the worker that was disconnected/disabled:

2023-03-06T09:00:15.800-0500    WARN    advmgr  sealer/sched_post.go:207        failed to check worker session  {"error": "RPC client error: sendRequest failed: Post \"http://127.0.0.1:3456/rpc/v0\": dial tcp 127.0.0.1:3456: connect: connection refused"}
2023-03-06T09:00:23.791-0500    INFO    wdpost  wdpost/wdpost_run.go:260        starting PoSt cycle     {"cycle": "2023-03-06T09:00:23.791-0500", "manual": true, "ts": "[bafy2bzacebpknp4ilwjeyodym7gkobekq7t3mnea747xnxl5mt5raqwadxlkg bafy2bzaceaxllka4s2ykvvs66tos6umoav767pwfizjydtyaxwbr6maeqzt2q bafy2bzacea544o47shrtmhjz7vnrdn2oqsbapdt3muuqkv326ivzygei2eekc bafy2bzaceanynwfjqvvw2g6t2rpa3tzb7mdctrrjf4mg2yxl3xejj4s64yalk bafy2bzacedhhwwsg7mzfkn7owbv623fo6nlumhx54kig37gakdoahssaubavw]", "deadline": 2}
2023-03-06T09:00:23.920-0500    WARN    wdpost  wdpost/wdpost_run.go:235        Checked sectors {"checked": 1, "good": 1}
2023-03-06T09:00:23.921-0500    INFO    wdpost  wdpost/wdpost_run.go:397        running window post     {"cycle": "2023-03-06T09:00:23.791-0500", "chain-random": "/pFw59MIXEi/y25WAMOZ24rFPb6vQY2QmKcAiqwP5/g=", "deadline": {"CurrentEpoch":142080,"PeriodStart":139480,"Index":2,"Open":142060,"Close":142120,"Challenge":139580,"FaultCutoff":141990,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "142080", "skipped": 0}
2023-03-06T09:00:23.921-0500    INFO    advmgr  sealer/manager_post.go:136      generateWindowPoSt maxPartitionSize:2 partitionCount:1
2023-03-06T09:00:23.921-0500    INFO    advmgr  sealer/manager_post.go:218      generateWindowPost      {"index": 0}
2023-03-06T09:00:23.921-0500    DEBUG   advmgr  sealer/sched_post.go:147        sched: not scheduling on PoSt-worker 5f1a2e86-e110-4f71-a65d-da24eea95cbc, worker disabled
2023-03-06T09:00:25.714-0500    WARN    advmgr  sealer/manager_post.go:231      generateWindowPost partition:0, get skip count:0
2023-03-06T09:00:25.714-0500    INFO    wdpost  wdpost/wdpost_run.go:412        computing window post   {"cycle": "2023-03-06T09:00:23.791-0500", "batch": 0, "elapsed": 1.793520635, "skip": 0, "err": null}
2023-03-06T09:00:25.728-0500    INFO    wdpost  wdpost/wdpost_run.go:262        post cycle done {"cycle": "2023-03-06T09:00:23.791-0500", "took": 1.937607964}

@magik6k magik6k force-pushed the fix/disabled-post-worker-handling branch from 6868e5c to b0ebdb6 Compare March 6, 2023 14:07
@rjan90 rjan90 linked an issue Mar 6, 2023 that may be closed by this pull request
18 tasks
@arajasek arajasek merged commit fea7dfd into master Mar 6, 2023
@arajasek arajasek deleted the fix/disabled-post-worker-handling branch March 6, 2023 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants