Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting distributed.scheduler.allowed-failures to 0 does not always work #6078

Closed
WillWang-MindBridge opened this issue Apr 6, 2022 · 10 comments

Comments

@WillWang-MindBridge
Copy link

What happened:
Set the configuration distributed.scheduler.allowed-failures to 0 and trigger a worker restart by filling up the memory. Sometimes(yes, you need to run the sample code several times to see that if you are lucky), dask seems to ignore that config and just retry the delayed function several times.

What you expected to happen:
When the property is set to 0, there should not be retries.

Minimal Complete Verifiable Example:

from time import sleep

import dask
import numpy as np
import pandas as pd
from dask import delayed
from distributed import Client, futures_of, LocalCluster


@delayed
def f1():
    print("running f1")
    sleep(5)
    df = pd.DataFrame(dict(row_id=np.zeros(10000000)))
    return df


def main():
    with dask.config.set({'distributed.scheduler.allowed-failures': 0, "distributed.logging.distributed": "DEBUG"}):
        with LocalCluster(n_workers=1, threads_per_worker=1, memory_limit="180MiB") as cluster, Client(cluster) as client:
            d = f1()
            future = futures_of(client.persist(d))[0]
            future.result()


if __name__ == "__main__":
    main()

Anything else we need to know?:
logs showing that f1 is executed 3 times

running f1
distributed.worker - WARNING - Worker is at 85% memory usage. Pausing worker.  Process memory: 153.12 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 153.12 MiB -- Worker memory limit: 180.00 MiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.scheduler - ERROR - Couldn't gather keys {'f1-433077ca-b354-4bc5-987a-53cfa64a0d0a': ['tcp://127.0.0.1:55016']} state: ['no-worker'] workers: ['tcp://127.0.0.1:55016']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:55016'], f1-433077ca-b354-4bc5-987a-53cfa64a0d0a
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'f1-433077ca-b354-4bc5-987a-53cfa64a0d0a': ('tcp://127.0.0.1:55016',)}
distributed.nanny - WARNING - Restarting worker
running f1
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.scheduler - ERROR - Couldn't gather keys {'f1-433077ca-b354-4bc5-987a-53cfa64a0d0a': ['tcp://127.0.0.1:55023']} state: ['no-worker'] workers: ['tcp://127.0.0.1:55023']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:55023'], f1-433077ca-b354-4bc5-987a-53cfa64a0d0a
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'f1-433077ca-b354-4bc5-987a-53cfa64a0d0a': ('tcp://127.0.0.1:55023',)}
distributed.nanny - WARNING - Restarting worker
running f1
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "/Users/my_user/git/my_project/toy12.py", line 28, in <module>
    main()
  File "/Users/my_user/git/my_project/toy12.py", line 23, in main
    future.result()
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/client.py", line 235, in result
    result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/utils.py", line 310, in sync
    return sync(
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/utils.py", line 364, in sync
    raise exc.with_traceback(tb)
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
    result[0] = yield future
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/client.py", line 260, in _result
    result = await self.client._gather([self])
  File "/Users/my_user/virtual_env/my_project/lib/python3.8/site-packages/distributed/client.py", line 1811, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('f1-433077ca-b354-4bc5-987a-53cfa64a0d0a', <WorkerState 'tcp://127.0.0.1:55032', name: 0, status: closed, memory: 0, processing: 1>)

Process finished with exit code 1

Environment:

  • Dask version: 2021.12.0
  • Python version: Python 3.8.9
  • Operating System: MacOS bigsur 11.2.2
  • Install method (conda, pip, source): pip
Cluster Dump State:
@WillWang-MindBridge WillWang-MindBridge changed the title the behaviour of setting distributed.scheduler.allowed-failures to 0 does not always work setting distributed.scheduler.allowed-failures to 0 does not always work Apr 6, 2022
@WillWang-MindBridge
Copy link
Author

I wrapped the similar of the above code into a unittest. It works fine in local (although the number of reruns is not determined, but it finished). But when it runs on circleCI, it seems stuck keep retring(way more retries) and got a different exception in the end.

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f474f6a88b0>>, <Task finished name='Task-13' coro=<Worker.close() done, defined at /home/********/.local/lib/python3.8/site-packages/distributed/worker.py:1504> exception=CommClosedError('ConnectionPool not running. Status: Status.closed')>)
Traceback (most recent call last):
  File "/home/********/.local/lib/python3.8/site-packages/distributed/core.py", line 1055, in connect
    comm = await fut
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/********/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/********/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/********/.local/lib/python3.8/site-packages/distributed/worker.py", line 1528, in close
    await r.close_gracefully()
  File "/home/********/.local/lib/python3.8/site-packages/distributed/core.py", line 883, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/home/********/.local/lib/python3.8/site-packages/distributed/core.py", line 1066, in connect
    raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed

@fjetter
Copy link
Member

fjetter commented Apr 7, 2022

It appears that your function finishes already before the worker can be killed. The scheduler receives notice about the finished function and tries to gather the result (this is why you see the messages about "cannot gather key"). Only once it is gathering, the worker dies and the scheduler needs to reschedule everything.

Since the task was already successful, the scheduler does not correlate the worker failure with the execution of the task but thinks something else must've killed it. If you change your example to

@delayed
def f1():
    print("running f1")
    df = pd.DataFrame(dict(row_id=np.zeros(10000000)))
    sleep(5)
    print("done running f1")
    return df

you should see it terminate deterministically.

The CommClosedError is a different issue. We're seeing similar failures on CI lately and are investigating

@WillWang-MindBridge
Copy link
Author

WillWang-MindBridge commented Apr 7, 2022

hi @fjetter thanks for the help! I don't get the part when you say "your function finishes already before the worker can be killed". This line df = pd.DataFrame(dict(row_id=np.zeros(10000000))) will trigger an exception immediately (maybe not the case if the creation of the big array is asynchronous).

When any exception is thrown from a delayed function, the function "exits". It's not "finishing" successfully, why does it send a notice to scheduler saying that it has a result ready for gathering? Sending a successful notice when some error actually happened sounds like a bug to me.

I see the change you made is to move the df = pd.DataFrame(dict(row_id=np.zeros(10000000))) line farther from the return. I think the sleep() plays an important role in getting deterministic result, which in itself is hilarious because we know we should not rely on sleep()

@WillWang-MindBridge
Copy link
Author

For the time being though, the temporary work around is to add something like sleep(10) to the end of all my delayed functions.
I'm just a little worried about the collection APIs. Since there's no sleep() there, what if memory exploded near the return?

@WillWang-MindBridge
Copy link
Author

More on the CI side, I pushed the work around, but it just keeps restarting the worker constantly.

@fjetter
Copy link
Member

fjetter commented Apr 7, 2022

will trigger an exception immediately (maybe not the case if the creation of the big array is asynchronous).

It does not trigger any exception for me but finishes successfully. The memory monitor then kicks in a bit later and kills the worker. The only way to ensure this raises directly is to allocate so much memory at once that the kernel kills your process immediately but from what I can see in your logs, this is clearly not the case.

@WillWang-MindBridge
Copy link
Author

So, I think you are right @fjetter and I was right about the creation of the big array is asynchronous. I changed it to

df = pd.DataFrame(dict(row_id=np.zeros(10000000000)))

but still no direct exception thrown from inside the f1().

I think this is still a valid scenario to be handled more gracefully by Dask:

  1. how to prevent a function from returning before a relate async error is handled
  2. if we cannot prevent the function returning, could we be smarter or more skeptical when deciding a task's state (e.g. the fact that a function returns does not necessarily mean it's successful. Dask should check the memory )?

And let me describe the problem more generally:

  1. when picking up a task, the schedule will estimate how much memory the input data will consume. If the total memory consumption on a worker exceeds certain threshold, the task will not be assigned to the worker
  2. however, the input data could be small enough to get the task accepted by a worker. But during the running of a task, the memory consumption can be further inflated.
  3. when the memory consumption reaches the 95%(though configurable) threshold and not large enough to get a direct exception from the OS, the function will return successfully.
  4. Now 2 things happen in parallel:
    4.1 the memory monitor will try to kill the worker in 1 thread
    4.2 the "result gatherer" will try to grab the result from the "successful" function run in another thread

if 4.1 happens before 4.2, it's the case I'm seeing. Even if 4.2 happens before 4.1, I don't how much further it could go.
The heuristic is that that we probably should do more checking before gathering in 4.2

@WillWang-MindBridge
Copy link
Author

More on the circleCI front. I realized from the log that memory usage didn't rise to 95% while inside the function f1. But it got there after exiting that function which caused the restart of the worker. I increased the size 100 times (I guess when running on circle/linux vs macOS, the memory allocation for np arrays are differently) and now it's failing in time. This is good news for me, in the meantime, it shows another example of how can the memory monitoring/restart mechanism be broken.

logs for the previously failed case

start f1
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 130.10 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 130.10 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Worker is at 92% memory usage. Pausing worker.  Process memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 165.98 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.50 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 166.75 MiB -- Worker memory limit: 180.00 MiB
end f1
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.scheduler - ERROR - Couldn't gather keys {'f1-9596b906-beca-4c03-ae14-ecced87213b4': ['tcp://127.0.0.1:36691']} state: ['memory'] workers: ['tcp://127.0.0.1:36691']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:36691'], f1-9596b906-beca-4c03-ae14-ecced87213b4
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'f1-9596b906-beca-4c03-ae14-ecced87213b4': ('tcp://127.0.0.1:36691',)}
distributed.nanny - WARNING - Restarting worker

@gjoseph92
Copy link
Collaborator

how to prevent a function from returning before a relate async error is handled

This sounds like another one that might be solved by #6177. I believe with an OS memory limit, trying to allocate the large array would just fail.

@jacobtomlinson
Copy link
Member

Given that there has been no followup for a few years I think it's safe to assume #6177 fixed this.

@jacobtomlinson jacobtomlinson marked this as a duplicate of #6074 Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants