Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633

Closed

Conversation

steffengy
Copy link
Contributor

@steffengy steffengy commented Apr 6, 2023

!!!!

Superseded by PR #10850

!!!!

If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.

So while #9407 is now enforced, overall broken state.

More like a draft PR I've confirmed to seem to work, if you have a better way to fix, let me know.
This works by not counting the same SchedId twice towards the limit.
@magik6k @rjan90

While as seen in below discussion the patch isnt exactly 100% right yet, its closer to previous behavior and atleast doesnt lead to not working at all / stuck scheduler.

@steffengy steffengy requested a review from a team as a code owner April 6, 2023 17:08
@steffengy steffengy force-pushed the fix/1.21-regr-get-32g branch 3 times, most recently from f7a5409 to 7040db0 Compare April 6, 2023 17:32
…g leads to stuck scheduler

If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner
and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType
both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

I believe this is not counting quite correctly, but the fix looks good otherwise.

tc.lk.Lock()
defer tc.lk.Unlock()
tc.taskCounters[tt]--
delete(tc.getUnlocked(tt), schedID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I believe this has a bug which makes it not work as we'd expect:

  • First we call .Add for preparing
  • Then we call .Add on active in .withResources
  • In .withResources we call .Free on preparing
  • Now the task runs - and we don't count the resources
  • Then we call .Free on preparing

This can be fixed by swapping map[sealtasks.SealTaskType]map[uuid.UUID]bool for map[sealtasks.SealTaskType]map[uuid.UUID]int and count how many times Add/Free was called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its allowing more than it should as-is.
Trying that and counting +1/-1 for each Add/Free invocation leads to wrong stuck behavior again:

ID        Sector  Worker    Hostname    Task  State        Time
00000000  23      4154622e  x  GET   assigned(1)  21.2s
00000000  24      4154622e  x  GET   assigned(2)  21.2s

Thats interesting, since only reasonable idea I have is that there's an invocation missing somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magik6k
Apparently the issue is the last Free for a SchedId happens after the last WINDOW assignPreparingWork
(which in turn says not scheduling on worker for startPreparing).
A quick test of returning update=true in schedWorker::waitForUpdates to force a regular invocation of that leads to the "GET backlog" getting fully resolved over time as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magik6k Any update on this? This is now broken in 1.21, 1.22 and 1.23.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only reasonable idea I have is that there's an invocation missing somewhere.

How did you implement Free with the counter? Did you delete from the map when the counter reached zero? If not the tasks := a.taskCounters.Get(tt) [...] if len(tasks) >= needRes.MaxConcurrent below would not work correctly as it still counted the zero enties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magik6k Yeah I've accounted for that:
steffengy@a3b7ec2
(I think else it also wouldnt work with making "scheduler entry" more often by returning update=true in schedWorker::waitForUpdates as that just evaluates the same at a later time; so its a timing/ordering issue last Free happening after the next scheduler entry for startPreparing and stuck there until next scheduling happening)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent some time playing with it (in #10850), and I think I found the issue with integer counters - 240fc6d

Would be great if you could see if it works for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Yeah in a quick test that looks working and fix-wise it seems plausible that that was the issue.
Closing this PR in favor of yours.

@@ -416,7 +416,7 @@ assignLoop:
}

needRes := worker.Info.Resources.ResourceSpec(todo.Sector.ProofType, todo.TaskType)
if worker.active.CanHandleRequest(todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but should be

Suggested change
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) {
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startReady", worker.Info) {

@TippyFlitsUK
Copy link
Contributor

The Lotus Team is actively working on this issue and will provide an update ASAP! 🙏

magik6k added a commit that referenced this pull request May 10, 2023
@steffengy steffengy closed this May 10, 2023
shrenujbansal pushed a commit that referenced this pull request May 10, 2023
* Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler

If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner
and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType
both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.

* itests: Repro issue from PR #10633

* make counters int (non-working)

* fix: worker sched: Send taskDone notifs after tasks are done

* itests: Make TestPledgeMaxConcurrentGet actually reproduce the issue

* make the linter happy

---------

Co-authored-by: Steffen Butzer <steffen.butzer@outlook.com>
shrenujbansal pushed a commit that referenced this pull request May 10, 2023
* Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler

If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner
and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType
both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.

* itests: Repro issue from PR #10633

* make counters int (non-working)

* fix: worker sched: Send taskDone notifs after tasks are done

* itests: Make TestPledgeMaxConcurrentGet actually reproduce the issue

* make the linter happy

---------

Co-authored-by: Steffen Butzer <steffen.butzer@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants