Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

jcrowthe · 2023-04-26T17:48:33Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

lotus daemon - chain sync
lotus fvm/fevm - Lotus FVM and FEVM interactions
lotus miner/worker - sealing
lotus miner - proving(WindowPoSt/WinningPoSt)
lotus JSON-RPC API
lotus message management (mpool)
Other

Lotus Version

# lotus version
Daemon:  1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus version 1.23.0+mainnet+git.d1d4b35
# lotus-miner version
Daemon:  1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus-miner version 1.23.0+mainnet+git.d1d4b35

Repro Steps

Seal a CC sector
Create a deal and attempt to snap-up the CC sector with the deal.
Watch lotus-miner sealing workers as the sector progresses through the sealing pipeline to RU state

Worker d5cde4c2-b6c1-4dee-a825-f145c1679f1b, host worker6
        TASK: RU(1/1)
        CPU:  [|                                                               ] 1/64 core(s) in use
        RAM:  [||||||                                                          ] 8% 46.2 GiB/503.8 GiB
        VMEM: [||||||                                                          ] 8% 42.2 GiB/503.8 GiB
        GPU:  [                                                                ] 0% 0.00/1 gpu(s) in use
        GPU: NVIDIA GeForce RTX 3090, not used

See that the GPU is considered "not used" from the lotus-miner's perspective.

Describe the Bug

The lotus-miner does not consider the ReplicaUpdate task as using any GPU. This can be acceptable in some cases, but in the case of highly-optimized sealing pipelines, this creates a scheduling problem where too many tasks are scheduled to a worker at the same time.

As a case in point, I’ve been testing snapping up deals and last night one of my workers had errors. 2 x PRU2 and 2 x RU tasks were all scheduled and executing simultaneously on the same 3090 GPU. This caused the GPU to go out of memory (see logs below).

Had the lotus-miner been cognizant that RU task requires access to a GPU, it would have only scheduled 2 of the 4 described tasks to the worker node, and the GPU would not have been overloaded.

As is standard for GPUs lots of VRAM, in this setup I am using two lotus-worker processes. Each process is configured with the following environment variable limits:

PRU2_32G_MAX_CONCURRENT="1"
RU_32G_MAX_CONCURRENT="1"

This configuration works well for the similarly situated PC2 and C2 processes, because lotus-miner recognizes both PC2 and C2 as using the GPU and so serializes the task executions. This bug report is to note that this scheduling situation dpes not happen the same for RU and PRU2 due to this lotus-miner bug.

Logging Information

{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:34.511+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"info","ts":"2023-04-26T06:08:10.119+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1023","msg":"generating tree r last using the GPU"}
{"level":"info","ts":"2023-04-26T06:09:33.360+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 1/8"}
{"level":"info","ts":"2023-04-26T06:10:54.977+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 2/8"}
{"level":"warn","ts":"2023-04-26T06:13:47.614+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"info","ts":"2023-04-26T06:14:36.531+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 3/8"}
thread 'worker-thread-13' panicked at 'failed to add final leaves: GpuError("GPU tools error: Cuda Error: \"out of memory\"")', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1115:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'thread 'worker-thread-2' panicked at '<unnamed>failed to send prepared data: SendError { .. }' panicked at '', failed to receive tree_data for tree_r_last: RecvError', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1073:30
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1126:22
thread '<unnamed>' panicked at 'Worker Pool was poisoned', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/yastl-0.1.2/src/wait.rs:50:13
stack backtrace:
   0:          0x386cead - std::backtrace_rs::backtrace::libunwind::trace::h02bdfeed412ba77b
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:          0x386cead - std::backtrace_rs::backtrace::trace_unsynchronized::h5721e9ec9537655b
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x386cead - std::sys_common::backtrace::_print_fmt::h10137ddb502bbd3d
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:66:5
   3:          0x386cead - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1437c86ead09a95e
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:45:22
   4:          0x38c8f2c - core::fmt::write::hdf7e5ac637575708
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/core/src/fmt/mod.rs:1196:17
   5:          0x385e591 - std::io::Write::write_fmt::h00341f121451a31a
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/io/mod.rs:1655:15
   6:          0x386fbc5 - std::sys_common::backtrace::_print::he4ececdb06ab9d22
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:48:5
   7:          0x386fbc5 - std::sys_common::backtrace::print::h55b32276835ed858
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:35:9
   8:          0x386fbc5 - std::panicking::default_hook::{{closure}}::h6999a27c7f7e27cf
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:295:22
   9:          0x386f839 - std::panicking::default_hook::hcc406adc7605c4f5
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:314:9
  10:          0x38702e8 - std::panicking::rust_panic_with_hook::haf8e1f62f460c64d
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:698:17
  11:          0x370de2b - std::panicking::begin_panic::{{closure}}::haaa95e9003bcdd32
  12:          0x370dd54 - std::sys_common::backtrace::__rust_end_short_backtrace::hf6ab743acabea4d4
  13:           0x76a90a - std::panicking::begin_panic::h3fdf381357f6c6e3
  14:          0x370af86 - yastl::wait::WaitGroup::join::hddbf361de6491d12
  15:          0x212c76b - yastl::scope::Scope::zoom::h965c8eb520d64278
  16:          0x25eaf3e - yastl::Pool::scoped::hcc2ab44bba4b79d2
  17:          0x3914752 - storage_proofs_porep::stacked::vanilla::proof::StackedDrg<Tree,G>::generate_tree_r_last::h593b7031f56a4538
  18:          0x239194f - storage_proofs_update::vanilla::EmptySectorUpdate<TreeR>::encode_into::h9356d6b25f3068a8
  19:          0x22fbe46 - filecoin_proofs::api::update::encode_into::he2a5e4585a2ad708
  20:          0x25a9e29 - filecoin_proofs_api::update::empty_sector_update_encode_into_inner::h23f85df5376f2f7b
  21:          0x25a7642 - filecoin_proofs_api::update::empty_sector_update_encode_into::hca473cfdd73b761f
  22:          0x216ff5c - std::panicking::try::h9efe7514770cfff9
  23:          0x256ad02 - filcrypto::util::types::catch_panic_response::h9180b29a50b96637
  24:          0x20aaba4 - empty_sector_update_encode_into
  25:          0x200d272 - _cgo_be609e58ba65_Cfunc_empty_sector_update_encode_into
                               at /tmp/go-build/cgo-gcc-prolog:124:11
  26:           0x805d84 - runtime.asmcgocall
                               at /usr/local/go/src/runtime/asm_amd64.s:848
thread panicked while panicking. aborting.
SIGABRT: abort
PC=0x7f0662abba7c m=17 sigcode=18446744073709551610
signal arrived during cgo execution

The text was updated successfully, but these errors were encountered:

jcrowthe · 2023-04-26T23:02:45Z

PR to fix submitted #10770

TippyFlitsUK · 2023-04-27T08:47:42Z

Many thanks @jcrowthe!!

jcrowthe added kind/bug Kind: Bug need/triage labels Apr 26, 2023

jcrowthe added a commit to jcrowthe/lotus that referenced this issue Apr 26, 2023

Fixes filecoin-project#10767

169cba5

jcrowthe mentioned this issue Apr 26, 2023

fix: sealing: Make lotus-worker report GPU usage to miner during ReplicaUpdate task #10770

Closed

7 tasks

TippyFlitsUK added kind/enhancement Kind: Enhancement need/team-input Hint: Needs Team Input area/ux Area: UX and removed need/triage kind/bug Kind: Bug labels Apr 27, 2023

TippyFlitsUK added the P2 P2: Should be resolved label Apr 27, 2023

shrenujbansal mentioned this issue May 2, 2023

fix: sealing: Make lotus-worker report GPU usage to miner during ReplicaUpdate task #10806

Merged

7 tasks

shrenujbansal closed this as completed in #10806 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

jcrowthe commented Apr 26, 2023

jcrowthe commented Apr 26, 2023 •

edited

Loading

TippyFlitsUK commented Apr 27, 2023

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

Comments

jcrowthe commented Apr 26, 2023

Checklist

Lotus component

Lotus Version

Repro Steps

Describe the Bug

Logging Information

jcrowthe commented Apr 26, 2023 • edited Loading

TippyFlitsUK commented Apr 27, 2023

jcrowthe commented Apr 26, 2023 •

edited

Loading