Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

Closed
5 of 11 tasks
jcrowthe opened this issue Apr 26, 2023 · 2 comments · Fixed by #10806
Closed
5 of 11 tasks

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource #10767

jcrowthe opened this issue Apr 26, 2023 · 2 comments · Fixed by #10806
Labels
area/ux Area: UX kind/enhancement Kind: Enhancement need/team-input Hint: Needs Team Input P2 P2: Should be resolved

Comments

@jcrowthe
Copy link
Contributor

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

# lotus version
Daemon:  1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus version 1.23.0+mainnet+git.d1d4b35
# lotus-miner version
Daemon:  1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus-miner version 1.23.0+mainnet+git.d1d4b35

Repro Steps

  1. Seal a CC sector
  2. Create a deal and attempt to snap-up the CC sector with the deal.
  3. Watch lotus-miner sealing workers as the sector progresses through the sealing pipeline to RU state
Worker d5cde4c2-b6c1-4dee-a825-f145c1679f1b, host worker6
        TASK: RU(1/1)
        CPU:  [|                                                               ] 1/64 core(s) in use
        RAM:  [||||||                                                          ] 8% 46.2 GiB/503.8 GiB
        VMEM: [||||||                                                          ] 8% 42.2 GiB/503.8 GiB
        GPU:  [                                                                ] 0% 0.00/1 gpu(s) in use
        GPU: NVIDIA GeForce RTX 3090, not used
  1. See that the GPU is considered "not used" from the lotus-miner's perspective.

Describe the Bug

The lotus-miner does not consider the ReplicaUpdate task as using any GPU. This can be acceptable in some cases, but in the case of highly-optimized sealing pipelines, this creates a scheduling problem where too many tasks are scheduled to a worker at the same time.

As a case in point, I’ve been testing snapping up deals and last night one of my workers had errors. 2 x PRU2 and 2 x RU tasks were all scheduled and executing simultaneously on the same 3090 GPU. This caused the GPU to go out of memory (see logs below).

Had the lotus-miner been cognizant that RU task requires access to a GPU, it would have only scheduled 2 of the 4 described tasks to the worker node, and the GPU would not have been overloaded.

As is standard for GPUs lots of VRAM, in this setup I am using two lotus-worker processes. Each process is configured with the following environment variable limits:

PRU2_32G_MAX_CONCURRENT="1"
RU_32G_MAX_CONCURRENT="1"

This configuration works well for the similarly situated PC2 and C2 processes, because lotus-miner recognizes both PC2 and C2 as using the GPU and so serializes the task executions. This bug report is to note that this scheduling situation dpes not happen the same for RU and PRU2 due to this lotus-miner bug.

Logging Information

{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:34.511+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"info","ts":"2023-04-26T06:08:10.119+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1023","msg":"generating tree r last using the GPU"}
{"level":"info","ts":"2023-04-26T06:09:33.360+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 1/8"}
{"level":"info","ts":"2023-04-26T06:10:54.977+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 2/8"}
{"level":"warn","ts":"2023-04-26T06:13:47.614+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"info","ts":"2023-04-26T06:14:36.531+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 3/8"}
thread 'worker-thread-13' panicked at 'failed to add final leaves: GpuError("GPU tools error: Cuda Error: \"out of memory\"")', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1115:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'thread 'worker-thread-2' panicked at '<unnamed>failed to send prepared data: SendError { .. }' panicked at '', failed to receive tree_data for tree_r_last: RecvError', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1073:30
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1126:22
thread '<unnamed>' panicked at 'Worker Pool was poisoned', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/yastl-0.1.2/src/wait.rs:50:13
stack backtrace:
   0:          0x386cead - std::backtrace_rs::backtrace::libunwind::trace::h02bdfeed412ba77b
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:          0x386cead - std::backtrace_rs::backtrace::trace_unsynchronized::h5721e9ec9537655b
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x386cead - std::sys_common::backtrace::_print_fmt::h10137ddb502bbd3d
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:66:5
   3:          0x386cead - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1437c86ead09a95e
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:45:22
   4:          0x38c8f2c - core::fmt::write::hdf7e5ac637575708
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/core/src/fmt/mod.rs:1196:17
   5:          0x385e591 - std::io::Write::write_fmt::h00341f121451a31a
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/io/mod.rs:1655:15
   6:          0x386fbc5 - std::sys_common::backtrace::_print::he4ececdb06ab9d22
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:48:5
   7:          0x386fbc5 - std::sys_common::backtrace::print::h55b32276835ed858
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:35:9
   8:          0x386fbc5 - std::panicking::default_hook::{{closure}}::h6999a27c7f7e27cf
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:295:22
   9:          0x386f839 - std::panicking::default_hook::hcc406adc7605c4f5
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:314:9
  10:          0x38702e8 - std::panicking::rust_panic_with_hook::haf8e1f62f460c64d
                               at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:698:17
  11:          0x370de2b - std::panicking::begin_panic::{{closure}}::haaa95e9003bcdd32
  12:          0x370dd54 - std::sys_common::backtrace::__rust_end_short_backtrace::hf6ab743acabea4d4
  13:           0x76a90a - std::panicking::begin_panic::h3fdf381357f6c6e3
  14:          0x370af86 - yastl::wait::WaitGroup::join::hddbf361de6491d12
  15:          0x212c76b - yastl::scope::Scope::zoom::h965c8eb520d64278
  16:          0x25eaf3e - yastl::Pool::scoped::hcc2ab44bba4b79d2
  17:          0x3914752 - storage_proofs_porep::stacked::vanilla::proof::StackedDrg<Tree,G>::generate_tree_r_last::h593b7031f56a4538
  18:          0x239194f - storage_proofs_update::vanilla::EmptySectorUpdate<TreeR>::encode_into::h9356d6b25f3068a8
  19:          0x22fbe46 - filecoin_proofs::api::update::encode_into::he2a5e4585a2ad708
  20:          0x25a9e29 - filecoin_proofs_api::update::empty_sector_update_encode_into_inner::h23f85df5376f2f7b
  21:          0x25a7642 - filecoin_proofs_api::update::empty_sector_update_encode_into::hca473cfdd73b761f
  22:          0x216ff5c - std::panicking::try::h9efe7514770cfff9
  23:          0x256ad02 - filcrypto::util::types::catch_panic_response::h9180b29a50b96637
  24:          0x20aaba4 - empty_sector_update_encode_into
  25:          0x200d272 - _cgo_be609e58ba65_Cfunc_empty_sector_update_encode_into
                               at /tmp/go-build/cgo-gcc-prolog:124:11
  26:           0x805d84 - runtime.asmcgocall
                               at /usr/local/go/src/runtime/asm_amd64.s:848
thread panicked while panicking. aborting.
SIGABRT: abort
PC=0x7f0662abba7c m=17 sigcode=18446744073709551610
signal arrived during cgo execution
@jcrowthe
Copy link
Contributor Author

jcrowthe commented Apr 26, 2023

PR to fix submitted #10770

@TippyFlitsUK TippyFlitsUK added kind/enhancement Kind: Enhancement need/team-input Hint: Needs Team Input area/ux Area: UX and removed need/triage kind/bug Kind: Bug labels Apr 27, 2023
@TippyFlitsUK
Copy link
Contributor

Many thanks @jcrowthe!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux Area: UX kind/enhancement Kind: Enhancement need/team-input Hint: Needs Team Input P2 P2: Should be resolved
Projects
None yet
2 participants