Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck at FinalizeSector after upgrading to v1.23.0 #10775

Closed
5 of 11 tasks
fishjar opened this issue Apr 27, 2023 · 9 comments
Closed
5 of 11 tasks

Stuck at FinalizeSector after upgrading to v1.23.0 #10775

fishjar opened this issue Apr 27, 2023 · 9 comments
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P1 P1: Must be resolved

Comments

@fishjar
Copy link

fishjar commented Apr 27, 2023

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

lotus-miner version 1.23.0+mainnet+git.d1d4b35ad.dirty

Repro Steps

No response

Describe the Bug

When the sealing jobs run to the last step GET, and the file has actually been copied to the long term storage, but the miner does not seem to know, and then jobs display assigned(2) state, not schedule changes again.

When running lotus-miner sectors list, these sectors be stuck in FinalizeSector.

And we found this bug only occurs when the environment variable GET_32G_MAX_CONCURRENT=1 is set. But when I remove this environment variable, all sectors will run GET at the same time, which is not what I want.

Slack has two more detailed discussions:
https://filecoinproject.slack.com/archives/CPFTWMY7N/p1681916543426419
https://filecoinproject.slack.com/archives/CPFTWMY7N/p1682462279289049

Logging Information

I didn't find any exception logs about this bug.
@TippyFlitsUK
Copy link
Contributor

Sincere thanks, Gabe!!

@TippyFlitsUK TippyFlitsUK added P1 P1: Must be resolved need/analysis Hint: Needs Analysis area/sealing and removed need/triage labels Apr 27, 2023
@jcace
Copy link

jcace commented Apr 27, 2023

+same behavior here. I had GET_32G_MAX_CONCURRENT="8" set on my node.

Downgrading to 1.22.1 fixed the issue as well

@donkabat
Copy link

here the same:
GET_32G_MAX_CONCURRENT="2"

@hdusten
Copy link

hdusten commented Apr 28, 2023

Same issue here please see the following:

  • Initially we had the variable GET_32G_MAX_CONCURRENT commented out trying to diagnose Finalize oddities we were having prior to going to 1.23.0+mainnet+git.d1d4b35ad.
  • After upgrade to 1.23.0 we pledged 9 sectors for testing. They all resulted in getting stuck in Finalize.
  • We brought the variable back by uncommenting it and then restarted all workers and miner.
  • After this GET jobs were assigned but did not run and were still stuck in Finalize.
  • We then commented the variable back out and restarted everything again.
  • The GET jobs ran and everything went into available.

However, all of these sectors are currently throwing the following when we run a lotus-miner proving check --slow:

0 0 458 bad (generating vanilla proof: generate_single_vanilla_proof: vanilla_proof failed: SectorId(458)

Caused by:
0: failed to read 16384 bytes from file at offset 3035791360
1: failed to fill whole buffer

Stack backtrace:
0: core::ops::function::FnOnce::call_once
1: <merkletree::store::level_cache::LevelCacheStore<E,R> as merkletree::store::Store>::read_range_into
2: merkletree::merkle::MerkleTree<E,A,S,BaseTreeArity,SubTreeArity,TopTreeArity>::gen_cached_proof
3: merkletree::merkle::MerkleTree<E,A,S,BaseTreeArity,SubTreeArity,TopTreeArity>::gen_cached_proof
4: <storage_proofs_core::merkle::tree::MerkleTreeWrapper<H,S,U,V,W> as storage_proofs_core::merkle::tree::MerkleTreeTrait>::gen_cached_proof
5: core::ops::function::impls::<impl core::ops::function::FnOnce for &mut F>::call_once
6: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
7: rayon::iter::plumbing::Producer::fold_with
8: rayon::iter::plumbing::bridge_producer_consumer::helper
9: std::panicking::try
10: rayon_core::join::join_context::{{closure}}
11: rayon_core::registry::in_worker
12: rayon::iter::plumbing::bridge_producer_consumer::helper
13: std::panicking::try
14: rayon_core::join::join_context::{{closure}}
15: rayon_core::registry::in_worker
16: rayon::iter::plumbing::bridge_producer_consumer::helper
17: std::panicking::try
18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
19: rayon_core::registry::WorkerThread::wait_until_cold
20: rayon_core::join::join_context::{{closure}}
21: rayon_core::registry::in_worker
22: rayon::iter::plumbing::bridge_producer_consumer::helper
23: rayon_core::job::StackJob<L,F,R>::run_inline
24: rayon_core::join::join_context::{{closure}}
25: rayon_core::registry::in_worker
26: rayon::iter::plumbing::bridge_producer_consumer::helper
27: std::panicking::try
28: rayon_core::join::join_context::{{closure}}
29: rayon_core::registry::in_worker
30: rayon::iter::plumbing::bridge_producer_consumer::helper
31: std::panicking::try
32: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
33: rayon_core::registry::WorkerThread::wait_until_cold
34: rayon_core::registry::ThreadBuilder::run
35: std::sys_common::backtrace::__rust_begin_short_backtrace
36: core::ops::function::FnOnce::call_once{{vtable.shim}}
37: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at ./rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/alloc/src/boxed.rs:1872:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
at ./rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/alloc/src/boxed.rs:1872:9
std::sys::unix::thread::Thread::new::thread_start
at ./rustc/cb121987158d69bb894ba1bcc21dc45d1e0a488f/library/std/src/sys/unix/thread.rs:108:17

Thank you!

@steffengy
Copy link
Contributor

GET_32G_MAX_CONCURRENT being set leading to stuck GET tasks sounds like the following regression introduced in 1.21 already. #10633

@benjaminh83
Copy link

I'm seeing the same issue. Also running GET_32G_MAX_CONCURRENT=1
I'm on v1.23.1 as well.

@benjaminh83
Copy link

Just learned that fix is in the makings (not to be used yet): #10633
Might hit a lotus rc release next week.

@donkabat
Copy link

The same problem for long term storage lotus-worker

@rjan90
Copy link
Contributor

rjan90 commented Sep 22, 2023

Closing this issue as this has been fixed with: #10850

@rjan90 rjan90 closed this as completed Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P1 P1: Must be resolved
Projects
None yet
Development

No branches or pull requests

8 participants