Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of MemoryError #2297

Open
louisabraham opened this issue Oct 7, 2018 · 4 comments
Open

Incorrect handling of MemoryError #2297

louisabraham opened this issue Oct 7, 2018 · 4 comments

Comments

@louisabraham
Copy link

louisabraham commented Oct 7, 2018

Following the discussion on dask/dask-jobqueue#169, I produced a sscce. Thank you @guillaumeeb and @mrocklin!

Description

The issue is that a MemoryError cannot be produced (except with an explicit raise) because the worker restarts before.

Motivations

It seems trivial, but there are situations when you want to catch those errors.
The bug originates from EpistasisLab/tpot#779 where an evolutionary algorithm tests scikit-learn pipelines. Some pipelines will produce a MemoryError (like PolynomialFeatures with a large number of columns)

Code

Note that the same works with the delayed interface.

from dask.distributed import Client

import numpy as np


def f(n):
    a = np.arange(n)
    return sum(a)


def errf(n):
    raise MemoryError


def sep():
    print()
    print('-' * 42)
    print()


if __name__ == '__main__':
    client = Client(ncores=1, memory_limit='200M')

    sep()

    try:
        r = client.submit(f, 42)
        print(r.result())
    except MemoryError as e:
        print('Just triggered a MemoryError', e)

    sep()

    try:
        r = client.submit(errf, 42)
        print(r.result())
    except MemoryError as e:
        print('Just triggered a MemoryError', e)

    sep()

    try:
        r = client.submit(f, 10**8)
        print(r.result())
    except MemoryError as e:
        print('Just triggered a MemoryError', e)

Output


------------------------------------------

861

------------------------------------------

distributed.worker - WARNING -  Compute Failed
Function:  errf
args:      (42)
kwargs:    {}
Exception: MemoryError()

distributed.scheduler - ERROR - error from worker tcp://127.0.0.1:57166:
Just triggered a MemoryError

------------------------------------------

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 22433 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 22432 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 22430 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 22431 was killed by unknown signal
Traceback (most recent call last):
  File "dask_version.py", line 45, in <module>
    print(r.result())
  File "/Users/louisabraham/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 171, in result
    six.reraise(*result)
  File "/Users/louisabraham/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
distributed.scheduler.KilledWorker: ('f-53758efeacb65d99d2d76fb863336f1b', 'tcp://127.0.0.1:57177')
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-7, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-6, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-5, started daemon)>
distributed.nanny - WARNING - Worker process 22436 was killed by unknown signal
distributed.nanny - WARNING - Worker process 22435 was killed by unknown signal
distributed.nanny - WARNING - Worker process 22434 was killed by unknown signal

Also, one can notice that the warnings are strange, because distributed.nanny should know what killed the workers.

Discussion

Since I only started looking at the internals of Dask yesterday, my opinion is very subjective ; but I see two options.

  1. When the memory usage is too big and the task causes more than n restarts, make it raise a MemoryError
  2. Wrap the system calls of the worker to make it think it has 95% of its memory budget so that it will naturally produce a MemoryError

The default behavior is to try executing the task 3 times. If a worker has a MemoryError because of a temporary problem on its node, then dask should retry the task.

I don't know if 1. is possible, but if it is, it offers the most flexibility as it would just be a variation on the default behavior that raises a MemoryError if the task failed 3 times because of a worker that restarted. This value could be customized for this particular reason of worker restart.

The 2. could be system dependent, and would probably need a mechanism to retry the tasks. This mechanism would be simple to handle from the application code, but I really think it should be provided by dask as well.

@mrocklin
Copy link
Member

mrocklin commented Oct 8, 2018

In principle replacing KilledWorker by MemoryError in these situations seems like an improvement. My only concern here is that I don't know how to do this easily in a way that will be easy to maintain in the future. Do you have any ideas or thoughts on how to accomplish this suggestion?

@louisabraham
Copy link
Author

louisabraham commented Oct 8, 2018

In principle replacing KilledWorker by MemoryError in these situations seems like an improvement

I don't see how you would do it. The MemoryError should be raised by the function that is being executed, but the KilledWorker is raised in the scheduler / nanny.

@darindf
Copy link
Contributor

darindf commented Oct 22, 2018

Looking this over, can one set the value

terminate: 0.95 # fraction at which we terminate the worker

to None, which will prevent the nanny from terminating the worker process in low memory situations (which in my case my tasks get lost, as the client is not notified that the task is no longer running, and the task is assigned to that particular worker, and not restarted, and if the worker was restarted, the worker would be in infinite loop of running out of memory and restart).

Note this could be the case anyhow, as there is an opportunity for the worker process to run of memory, without the nanny knowing, as I seemed to have encountered these on occasion as well.

@crusaderky
Copy link
Collaborator

I think this would require the nanny to communicate to the scheduler (AFAIK it currently doesn't, at all) that it just killed the worker, and this information must reach the scheduler before it has a chance to raise KilledWorker

Use case 1

Worker dies abruptly. There is a Nanny.
N->S: I just killed the worker because it passed the terminate threshold
S: raise MemoryError

Use case 2

Worker dies abruptly. There is a Nanny.
N->S: the worker just died on its own with this exit code
S: raise KilledWorker including the exit code

Use case 3

Worker dies abruptly. There is a Nanny.
S: timeout waiting for Nanny to speak up expires
S: log that the Nanny looks dead, so probably the whole host or network died
S: raise KilledWorker with no exit code

Use case 4

Worker dies abruptly. There is no Nanny.
S: raise KilledWorker with no exit code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants