You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a security-related bug/issue. If it is, please follow please follow the security policy.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.
Lotus component
lotus daemon - chain sync
lotus fvm/fevm - Lotus FVM and FEVM interactions
lotus miner/worker - sealing
lotus miner - proving(WindowPoSt/WinningPoSt)
lotus JSON-RPC API
lotus message management (mpool)
Other
Lotus Version
# lotus version
Daemon: 1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus version 1.23.0+mainnet+git.d1d4b35
# lotus-miner version
Daemon: 1.23.0+mainnet+git.d1d4b35+api1.5.0
Local: lotus-miner version 1.23.0+mainnet+git.d1d4b35
Repro Steps
Seal a CC sector
Create a deal and attempt to snap-up the CC sector with the deal.
Watch lotus-miner sealing workers as the sector progresses through the sealing pipeline to RU state
Worker d5cde4c2-b6c1-4dee-a825-f145c1679f1b, host worker6
TASK: RU(1/1)
CPU: [| ] 1/64 core(s) in use
RAM: [|||||| ] 8% 46.2 GiB/503.8 GiB
VMEM: [|||||| ] 8% 42.2 GiB/503.8 GiB
GPU: [ ] 0% 0.00/1 gpu(s) in use
GPU: NVIDIA GeForce RTX 3090, not used
See that the GPU is considered "not used" from the lotus-miner's perspective.
Describe the Bug
The lotus-miner does not consider the ReplicaUpdate task as using any GPU. This can be acceptable in some cases, but in the case of highly-optimized sealing pipelines, this creates a scheduling problem where too many tasks are scheduled to a worker at the same time.
As a case in point, I’ve been testing snapping up deals and last night one of my workers had errors. 2 x PRU2 and 2 x RU tasks were all scheduled and executing simultaneously on the same 3090 GPU. This caused the GPU to go out of memory (see logs below).
Had the lotus-miner been cognizant that RU task requires access to a GPU, it would have only scheduled 2 of the 4 described tasks to the worker node, and the GPU would not have been overloaded.
As is standard for GPUs lots of VRAM, in this setup I am using two lotus-worker processes. Each process is configured with the following environment variable limits:
This configuration works well for the similarly situated PC2 and C2 processes, because lotus-miner recognizes both PC2 and C2 as using the GPU and so serializes the task executions. This bug report is to note that this scheduling situation dpes not happen the same for RU and PRU2 due to this lotus-miner bug.
Logging Information
{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:20.011Z","logger":"fsutil","caller":"fsutil/filesize_unix.go:43","msg":"very slow file size check","took":20.35671601,"path":"/mnt/worker2/update/s-t02028544-9620"}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:05:54.626Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":11.1010481}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:16.141Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":20.280543506}
{"level":"warn","ts":"2023-04-26T06:06:34.511+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"warn","ts":"2023-04-26T06:06:37.303Z","logger":"stores","caller":"paths/localstorage_cached.go:123","msg":"getting usage is slow, falling back to previous usage","path":"/mnt/worker2/update/s-t02028544-9620","fallback":34359746560,"age":19.820480956}
{"level":"info","ts":"2023-04-26T06:08:10.119+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1023","msg":"generating tree r last using the GPU"}
{"level":"info","ts":"2023-04-26T06:09:33.360+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 1/8"}
{"level":"info","ts":"2023-04-26T06:10:54.977+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 2/8"}
{"level":"warn","ts":"2023-04-26T06:13:47.614+0000","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.24.1/src/gpu/locks.rs:259","msg":"GPU Multiexp failed! Falling back to CPU... Error: EC GPU error: GPU tools error: Cuda Error: \"out of memory\""}
{"level":"info","ts":"2023-04-26T06:14:36.531+0000","logger":"storage_proofs_porep::stacked::vanilla::proof","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1108","msg":"building base tree_r_last with GPU 3/8"}
thread 'worker-thread-13' panicked at 'failed to add final leaves: GpuError("GPU tools error: Cuda Error: \"out of memory\"")', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1115:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'thread 'worker-thread-2' panicked at '<unnamed>failed to send prepared data: SendError { .. }' panicked at '', failed to receive tree_data for tree_r_last: RecvError', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1073:30
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-porep-14.0.0/src/stacked/vanilla/proof.rs:1126:22
thread '<unnamed>' panicked at 'Worker Pool was poisoned', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/yastl-0.1.2/src/wait.rs:50:13
stack backtrace:
0: 0x386cead - std::backtrace_rs::backtrace::libunwind::trace::h02bdfeed412ba77b
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x386cead - std::backtrace_rs::backtrace::trace_unsynchronized::h5721e9ec9537655b
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x386cead - std::sys_common::backtrace::_print_fmt::h10137ddb502bbd3d
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:66:5
3: 0x386cead - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1437c86ead09a95e
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:45:22
4: 0x38c8f2c - core::fmt::write::hdf7e5ac637575708
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/core/src/fmt/mod.rs:1196:17
5: 0x385e591 - std::io::Write::write_fmt::h00341f121451a31a
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/io/mod.rs:1655:15
6: 0x386fbc5 - std::sys_common::backtrace::_print::he4ececdb06ab9d22
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:48:5
7: 0x386fbc5 - std::sys_common::backtrace::print::h55b32276835ed858
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys_common/backtrace.rs:35:9
8: 0x386fbc5 - std::panicking::default_hook::{{closure}}::h6999a27c7f7e27cf
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:295:22
9: 0x386f839 - std::panicking::default_hook::hcc406adc7605c4f5
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:314:9
10: 0x38702e8 - std::panicking::rust_panic_with_hook::haf8e1f62f460c64d
at /rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/panicking.rs:698:17
11: 0x370de2b - std::panicking::begin_panic::{{closure}}::haaa95e9003bcdd32
12: 0x370dd54 - std::sys_common::backtrace::__rust_end_short_backtrace::hf6ab743acabea4d4
13: 0x76a90a - std::panicking::begin_panic::h3fdf381357f6c6e3
14: 0x370af86 - yastl::wait::WaitGroup::join::hddbf361de6491d12
15: 0x212c76b - yastl::scope::Scope::zoom::h965c8eb520d64278
16: 0x25eaf3e - yastl::Pool::scoped::hcc2ab44bba4b79d2
17: 0x3914752 - storage_proofs_porep::stacked::vanilla::proof::StackedDrg<Tree,G>::generate_tree_r_last::h593b7031f56a4538
18: 0x239194f - storage_proofs_update::vanilla::EmptySectorUpdate<TreeR>::encode_into::h9356d6b25f3068a8
19: 0x22fbe46 - filecoin_proofs::api::update::encode_into::he2a5e4585a2ad708
20: 0x25a9e29 - filecoin_proofs_api::update::empty_sector_update_encode_into_inner::h23f85df5376f2f7b
21: 0x25a7642 - filecoin_proofs_api::update::empty_sector_update_encode_into::hca473cfdd73b761f
22: 0x216ff5c - std::panicking::try::h9efe7514770cfff9
23: 0x256ad02 - filcrypto::util::types::catch_panic_response::h9180b29a50b96637
24: 0x20aaba4 - empty_sector_update_encode_into
25: 0x200d272 - _cgo_be609e58ba65_Cfunc_empty_sector_update_encode_into
at /tmp/go-build/cgo-gcc-prolog:124:11
26: 0x805d84 - runtime.asmcgocall
at /usr/local/go/src/runtime/asm_amd64.s:848
thread panicked while panicking. aborting.
SIGABRT: abort
PC=0x7f0662abba7c m=17 sigcode=18446744073709551610
signal arrived during cgo execution
The text was updated successfully, but these errors were encountered:
Checklist
Latest release
, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Repro Steps
lotus-miner sealing workers
as the sector progresses through the sealing pipeline to RU stateDescribe the Bug
The lotus-miner does not consider the ReplicaUpdate task as using any GPU. This can be acceptable in some cases, but in the case of highly-optimized sealing pipelines, this creates a scheduling problem where too many tasks are scheduled to a worker at the same time.
As a case in point, I’ve been testing snapping up deals and last night one of my workers had errors. 2 x PRU2 and 2 x RU tasks were all scheduled and executing simultaneously on the same 3090 GPU. This caused the GPU to go out of memory (see logs below).
Had the lotus-miner been cognizant that RU task requires access to a GPU, it would have only scheduled 2 of the 4 described tasks to the worker node, and the GPU would not have been overloaded.
As is standard for GPUs lots of VRAM, in this setup I am using two lotus-worker processes. Each process is configured with the following environment variable limits:
This configuration works well for the similarly situated PC2 and C2 processes, because lotus-miner recognizes both PC2 and C2 as using the GPU and so serializes the task executions. This bug report is to note that this scheduling situation dpes not happen the same for RU and PRU2 due to this lotus-miner bug.
Logging Information
The text was updated successfully, but these errors were encountered: