Worker can no longer detect faulty dependencies causing infinite loop #4439

fjetter · 2021-01-19T08:39:36Z

What happened:

For cases where there are broken dependencies, e.g. deserialization isn't working, the workers no longer have the possibility to detect this failure and try indefinitely instead of raising an exception.

What you expected to happen:

The computation terminates with an appropriate exception.

Minimal Complete Verifiable Example:

@gen_cluster(client=True, Worker=Nanny, timeout=4600)
async def test_get_data_faulty_dep(c, s, a, b):
    """This test creates a broken dependency and forces serialization by
    requiring it to be submitted to another worker. The computation should
    eventually finish by flagging the dep as bad and raise an appropriate
    exception.
    """

    class BrokenDeserialization:
        def __setstate__(self, *state):
            raise AttributeError()

        def __getstate__(self, *args):
            return ""

    def create():
        return BrokenDeserialization()

    def collect(*args):
        return args

    fut1 = c.submit(create, workers=[a.name])

    fut2 = c.submit(collect, fut1, workers=[b.name])

    with pytest.raises(RuntimeError, match="Could not find dependencies for collect-"):
        await fut2.result()

Anything else we need to know?:

I tried addressing this issue in #4360 but I believe this is difficult to resolve without a more fundamental rework of the way tasks states are managed.

I explained the problem of this infinite loop in #4360 (comment) since this is a race condition between the worker-internal missing dep handler and the scheduler escalation. The scheduler escalation triggers a release and the worker forgets the dependency. The TaskState.suspicious_count is therefore reset and the exception is never triggered.

I believe resolving this requires the introduction of additional task states on worker side (xref #4413; similar to released/forgotten on scheduler) where the meta information of a task is only forgotten once the scheduler allows this to happen. Alternatively, the scheduler must keep track of these suspicious tasks, relieving the worker of its responsibility to track these things.

cc @mrocklin @gforsyth

The text was updated successfully, but these errors were encountered:

mrocklin · 2021-01-19T14:52:46Z

Thank you for the reproducible error @fjetter . I'm going to start prioritizing this more highly on my end. I'll take a longer look later this afternoon.

…

On Tue, Jan 19, 2021 at 12:39 AM Florian Jetter ***@***.***> wrote: *What happened*: For cases where there are broken dependencies, e.g. deserialization isn't working, the workers no longer have the possibility to detect this failure and try indefinitely instead of raising an exception. *What you expected to happen*: The computation terminates with an appropriate exception. *Minimal Complete Verifiable Example*: @gen_cluster(client=True, Worker=Nanny, timeout=4600)async def test_get_data_faulty_dep(c, s, a, b): """This test creates a broken dependency and forces serialization by requiring it to be submitted to another worker. The computation should eventually finish by flagging the dep as bad and raise an appropriate exception. """ class BrokenDeserialization: def __setstate__(self, *state): raise AttributeError() def __getstate__(self, *args): return "" def create(): return BrokenDeserialization() def collect(*args): return args fut1 = c.submit(create, workers=[a.name]) fut2 = c.submit(collect, fut1, workers=[b.name]) with pytest.raises(RuntimeError, match="Could not find dependencies for collect-"): await fut2.result() *Anything else we need to know?*: I tried addressing this issue in #4360 <#4360> but I believe this is difficult to resolve without a more fundamental rework of the way tasks states are managed. I explained the problem of this infinite loop in #4360 (comment) <#4360 (comment)> since this is a race condition between the worker-internal missing dep handler and the scheduler escalation. The scheduler escalation triggers a release and the worker forgets the dependency. The TaskState.suspicious_count is therefore reset and the exception is never triggered. I believe resolving this requires the introduction of additional task states on worker side (xref #4413 <#4413>; similar to released/forgotten on scheduler) where the meta information of a task is only forgotten once the scheduler allows this to happen. Alternatively, the scheduler must keep track of these suspicious tasks, relieving the worker of its responsibility to track these things. cc @mrocklin <https://github.com/mrocklin> @gforsyth <https://github.com/gforsyth> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4439>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCCD35PHBNBWG2DJ7DS2VANVANCNFSM4WINQLFQ> .

mrocklin · 2021-01-19T14:53:56Z

I'm curious, is there a reason why these need to be Nanny objects?

…

On Tue, Jan 19, 2021 at 6:52 AM Matthew Rocklin ***@***.***> wrote: Thank you for the reproducible error @fjetter . I'm going to start prioritizing this more highly on my end. I'll take a longer look later this afternoon. On Tue, Jan 19, 2021 at 12:39 AM Florian Jetter ***@***.***> wrote: > *What happened*: > > For cases where there are broken dependencies, e.g. deserialization isn't > working, the workers no longer have the possibility to detect this failure > and try indefinitely instead of raising an exception. > > *What you expected to happen*: > > The computation terminates with an appropriate exception. > > *Minimal Complete Verifiable Example*: > > @gen_cluster(client=True, Worker=Nanny, timeout=4600)async def test_get_data_faulty_dep(c, s, a, b): > """This test creates a broken dependency and forces serialization by requiring it to be submitted to another worker. The computation should eventually finish by flagging the dep as bad and raise an appropriate exception. """ > > class BrokenDeserialization: > def __setstate__(self, *state): > raise AttributeError() > > def __getstate__(self, *args): > return "" > > def create(): > return BrokenDeserialization() > > def collect(*args): > return args > > fut1 = c.submit(create, workers=[a.name]) > > fut2 = c.submit(collect, fut1, workers=[b.name]) > > with pytest.raises(RuntimeError, match="Could not find dependencies for collect-"): > await fut2.result() > > *Anything else we need to know?*: > > I tried addressing this issue in #4360 > <#4360> but I believe this is > difficult to resolve without a more fundamental rework of the way tasks > states are managed. > > I explained the problem of this infinite loop in #4360 (comment) > <#4360 (comment)> > since this is a race condition between the worker-internal missing dep > handler and the scheduler escalation. The scheduler escalation triggers a > release and the worker forgets the dependency. The > TaskState.suspicious_count is therefore reset and the exception is never > triggered. > > I believe resolving this requires the introduction of additional task > states on worker side (xref #4413 > <#4413>; similar to > released/forgotten on scheduler) where the meta information of a task is > only forgotten once the scheduler allows this to happen. Alternatively, the > scheduler must keep track of these suspicious tasks, relieving the worker > of its responsibility to track these things. > > cc @mrocklin <https://github.com/mrocklin> @gforsyth > <https://github.com/gforsyth> > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4439>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTCCD35PHBNBWG2DJ7DS2VANVANCNFSM4WINQLFQ> > . >

fjetter · 2021-01-19T15:49:41Z

No the nanny is not necessary and is a relict of my debugging sessions. When using a debugger it was often quite helpful since I could pause individual workers and observe the system in a more natural state and/or provoke timing sensitive race conditions by pausing one of the processes. (Disclaimer: My brain almost exploded while doing so but it was really helpful tracking some of the issues in the past weeks :) )

Same thing for the timeout. A timeout of 4k seconds is otherwise not really sensible. That should be removed.

fjetter added the bug Something is broken label Jan 19, 2021

This was referenced Jan 19, 2021

"Set changed size during iteration" Error in Worker.put_key_in_memory() #4371

Closed

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Closed

fjetter mentioned this issue Mar 1, 2021

Worker and scheduler disagree about key/dep's status #4550

Closed

fjetter mentioned this issue Apr 12, 2021

Internal worker state transitions #4413

Closed

fjetter mentioned this issue May 5, 2021

Timed out trying to connect ... : connect() didn't finish in time #4080

Open

jrbourbeau mentioned this issue Aug 30, 2021

Add list around iterator in handle_missing_dep #5285

Merged

3 tasks

fjetter added the stability Issue or feature related to cluster stability (e.g. deadlock) label Mar 23, 2022

fjetter added the deadlock The cluster appears to not make any progress label May 25, 2022

This was referenced Aug 4, 2022

Failure to pickle/unpickle on flight breaks the worker state machine #6705

Open

Remove resumed state #6716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker can no longer detect faulty dependencies causing infinite loop #4439

Worker can no longer detect faulty dependencies causing infinite loop #4439

fjetter commented Jan 19, 2021

mrocklin commented Jan 19, 2021 via email

mrocklin commented Jan 19, 2021 via email

fjetter commented Jan 19, 2021 •

edited

Loading

Worker can no longer detect faulty dependencies causing infinite loop #4439

Worker can no longer detect faulty dependencies causing infinite loop #4439

Comments

fjetter commented Jan 19, 2021

mrocklin commented Jan 19, 2021 via email

mrocklin commented Jan 19, 2021 via email

fjetter commented Jan 19, 2021 • edited Loading

fjetter commented Jan 19, 2021 •

edited

Loading