Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

steffengy · 2023-04-06T17:08:46Z

!!!!

Superseded by PR #10850

!!!!

If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.

So while #9407 is now enforced, overall broken state.

More like a draft PR I've confirmed to seem to work, if you have a better way to fix, let me know.
This works by not counting the same SchedId twice towards the limit.
@magik6k @rjan90

While as seen in below discussion the patch isnt exactly 100% right yet, its closer to previous behavior and atleast doesnt lead to not working at all / stuck scheduler.

…g leads to stuck scheduler If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck. I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.

magik6k

Great catch!

I believe this is not counting quite correctly, but the fix looks good otherwise.

magik6k · 2023-04-07T10:56:55Z

storage/sealer/sched_resources.go

 	tc.lk.Lock()
 	defer tc.lk.Unlock()
-	tc.taskCounters[tt]--
+	delete(tc.getUnlocked(tt), schedID)


So I believe this has a bug which makes it not work as we'd expect:

First we call .Add for preparing

Then we call .Add on active in .withResources

In .withResources we call .Free on preparing

Now the task runs - and we don't count the resources

Then we call .Free on preparing

This can be fixed by swapping map[sealtasks.SealTaskType]map[uuid.UUID]bool for map[sealtasks.SealTaskType]map[uuid.UUID]int and count how many times Add/Free was called.

Yeah its allowing more than it should as-is.
Trying that and counting +1/-1 for each Add/Free invocation leads to wrong stuck behavior again:

ID Sector Worker Hostname Task State Time 00000000 23 4154622e x GET assigned(1) 21.2s 00000000 24 4154622e x GET assigned(2) 21.2s

Thats interesting, since only reasonable idea I have is that there's an invocation missing somewhere.

@magik6k
Apparently the issue is the last Free for a SchedId happens after the last WINDOW assignPreparingWork
(which in turn says not scheduling on worker for startPreparing).
A quick test of returning update=true in schedWorker::waitForUpdates to force a regular invocation of that leads to the "GET backlog" getting fully resolved over time as well.

@magik6k Any update on this? This is now broken in 1.21, 1.22 and 1.23.

only reasonable idea I have is that there's an invocation missing somewhere.

How did you implement Free with the counter? Did you delete from the map when the counter reached zero? If not the tasks := a.taskCounters.Get(tt) [...] if len(tasks) >= needRes.MaxConcurrent below would not work correctly as it still counted the zero enties.

@magik6k Yeah I've accounted for that:
steffengy@a3b7ec2
(I think else it also wouldnt work with making "scheduler entry" more often by returning update=true in schedWorker::waitForUpdates as that just evaluates the same at a later time; so its a timing/ordering issue last Free happening after the next scheduler entry for startPreparing and stuck there until next scheduling happening)

I've spent some time playing with it (in #10850), and I think I found the issue with integer counters - 240fc6d

Would be great if you could see if it works for you

Great! Yeah in a quick test that looks working and fix-wise it seems plausible that that was the issue.
Closing this PR in favor of yours.

magik6k · 2023-05-02T20:15:40Z

storage/sealer/sched_worker.go

@@ -416,7 +416,7 @@ assignLoop:
 				}

 				needRes := worker.Info.Resources.ResourceSpec(todo.Sector.ProofType, todo.TaskType)
-				if worker.active.CanHandleRequest(todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {
+				if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {


Not related to this PR, but should be

Suggested change

if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {

if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startReady", worker.Info) {

TippyFlitsUK · 2023-05-09T18:06:53Z

The Lotus Team is actively working on this issue and will provide an update ASAP! 🙏

* Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck. I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit. * itests: Repro issue from PR #10633 * make counters int (non-working) * fix: worker sched: Send taskDone notifs after tasks are done * itests: Make TestPledgeMaxConcurrentGet actually reproduce the issue * make the linter happy --------- Co-authored-by: Steffen Butzer <steffen.butzer@outlook.com>

steffengy requested a review from a team as a code owner April 6, 2023 17:08

steffengy force-pushed the fix/1.21-regr-get-32g branch 3 times, most recently from f7a5409 to 7040db0 Compare April 6, 2023 17:32

steffengy force-pushed the fix/1.21-regr-get-32g branch from 7040db0 to 7cb8172 Compare April 6, 2023 17:42

magik6k reviewed Apr 7, 2023

View reviewed changes

steffengy mentioned this pull request May 1, 2023

Stuck at FinalizeSector after upgrading to v1.23.0 #10775

Closed

11 tasks

magik6k reviewed May 2, 2023

View reviewed changes

magik6k added a commit that referenced this pull request May 10, 2023

itests: Repro issue from PR #10633

b27ccef

magik6k mentioned this pull request May 10, 2023

fix: sched: Address GET_32G_MAX_CONCURRENT regression #10850

Merged

7 tasks

steffengy closed this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

steffengy commented Apr 6, 2023 •

edited

Loading

magik6k left a comment

magik6k Apr 7, 2023

steffengy Apr 7, 2023

steffengy Apr 7, 2023

steffengy Apr 24, 2023

magik6k May 2, 2023

steffengy May 3, 2023

magik6k May 10, 2023

steffengy May 10, 2023

magik6k May 2, 2023

TippyFlitsUK commented May 9, 2023

	if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {
	if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startReady", worker.Info) {

Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

Conversation

steffengy commented Apr 6, 2023 • edited Loading

Superseded by PR #10850

magik6k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TippyFlitsUK commented May 9, 2023

steffengy commented Apr 6, 2023 •

edited

Loading