Avoid KeyError on worker restart #138

guillaumeeb · 2018-08-24T20:34:18Z

Closes #117.

Should help diagnose problem in #122.

I chose not to raise any Error, even if the job_id is not to be found in finished_jobs of the Scheduler plugin. User may want to manually add jobs and worker to the cluster? Though this is not a good practice.

guillaumeeb · 2018-08-24T20:36:55Z

@raybellwaves if you still go your memory use case somewhere?

@wgustafson if you want to give that a go?

Don't see how to do a simple unit test for now.
But I could probably build a dask workflow that makes worker die with out of memory on my system if needed.

wgustafson · 2018-08-25T00:36:23Z

I've been on travel so apologies for the slow response the last week. I have a script that was throwing key errors but I need to clean things up and see if it still reacts the same way. I will try to do this early this coming week.

raybellwaves · 2018-08-25T02:26:17Z

Yeah my memory issue use case is https://gist.github.com/raybellwaves/f28777bf840cc40f4c76d88beca528c5. I've added the output to the file.

What's the best way to test your branch (this version)? I'm not familiar cloning branches.

guillaumeeb · 2018-08-25T08:17:17Z

No worries @wgustafson! Was just browsing issues and updating things, I understand perfectly you've got other things to do, no hurry.

wgustafson · 2018-08-27T17:10:00Z

Taking a look at this again this morning. My reproducible KeyError issue happens with the code I posted in #122 on August 14, 2018, so I just put the log output into that issue to keep things together. In my case, I can be certain the problem is not from a lack of memory. So, I don't know if my problem is the same is @raybellwaves or not.

lesteve · 2018-08-29T16:13:20Z

What's the best way to test your branch (this version)? I'm not familiar cloning branches.

@raybellwaves here is something that should work for checking out this PR code:

# creates a branch pr/138 with the code from #138
# Run this from your dask-jobqueue git checkout
git fetch -fu upstream refs/pull/138/head:pr/138 && git checkout pr/138
git checkout pr/138

Personally I have a git alias for this (there are other options for this as well if you google for it). Here is an excerpt from my .gitconfig:

[alias]
        pr  = "!f() { git checkout master && git fetch -fu ${2:-upstream} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

Basically I can do this to get the code of this PR (it also works for getting the latest version of the PR):

git pr 138

raybellwaves · 2018-08-30T13:39:04Z

@lesteve thanks. The alias looks very handy.

@guillaumeeb sorry i'll be on vacation until mid-September and won't have chance to try this until then.

raybellwaves · 2018-09-18T20:25:27Z

I can confirm the KeyError has gone away. It now says

Job 17752343 has got a restarting worker tcp://10.10.103.1:34877
Job 17752337 has got a restarting worker tcp://10.10.103.9:58938
...

and looking at the dask dashboard I can see some of the tasks erred:

My output is here: https://gist.github.com/raybellwaves/3033fe1679510ea4d5d64e0658225fc2

guillaumeeb · 2018-09-19T14:57:11Z

Thanks @raybellwaves, looking at your code, since you're trying to persist the whole dataframe in memory, I assume you got some Memory errors on your workers, which seems to be confirmed by your dashboard screenshot and the orange bars we see.

If I could get some code reviews here this would be nice.

mrocklin · 2018-09-19T15:10:45Z

If I could get some code reviews here this would be nice.

According to git blame it looks like @jhamman probably has experience with this part of the codebase. It might be worth pinging him directly.

raybellwaves · 2018-09-19T23:47:14Z

Thanks @raybellwaves, looking at your code, since you're trying to persist the whole dataframe in memory, I assume you got some Memory errors on your workers, which seems to be confirmed by your dashboard screenshot and the orange bars we see.

Yep. A one terabyte data frame is a bit too much for our HPC as it's often hard to get multiple jobs running. The output of JOB N has got a restarting worker is now an improvement on the KeyError.

Is there a way to provide information in the output of memory issues? If I didn't have the dashboard up and just saw JOB N has got a restarting worker as a novice user I wouldn't know what that means. I wonder if it can say something like Warning: Potential memory issue. JOB N has got a restarting worker? This may be an upstream issue though dask.distributed? dask?

guillaumeeb · 2018-09-20T09:42:31Z

I wonder if it can say something like Warning: Potential memory issue. JOB N has got a restarting worker? This may be an upstream issue though dask.distributed? dask?

This message is currently in dask-jobqueue, we can improve it there and give an hint that this may be due to some memory problem.

At the same time, there maybe something to do upstream to get the information about why the worker has been restarted...

Search for job_id in finished_jobs, or even create new one

db175b4

guillaumeeb changed the title ~~Search for job_id in finished_jobs, or even create new one~~ Avoid KeyError on worker restart Aug 24, 2018

guillaumeeb mentioned this pull request Aug 26, 2018

Error when scaling a function over a wider range of parameters (LSFCluster) #139

Closed

warning about potential memory issue

19551d1

guillaumeeb merged commit 58dffc5 into dask:master Oct 2, 2018

guillaumeeb mentioned this pull request Oct 15, 2018

Handling workers with expiring allocation requests #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid KeyError on worker restart #138

Avoid KeyError on worker restart #138

guillaumeeb commented Aug 24, 2018

guillaumeeb commented Aug 24, 2018

wgustafson commented Aug 25, 2018

raybellwaves commented Aug 25, 2018 •

edited

Loading

guillaumeeb commented Aug 25, 2018

wgustafson commented Aug 27, 2018

lesteve commented Aug 29, 2018

raybellwaves commented Aug 30, 2018

raybellwaves commented Sep 18, 2018 •

edited

Loading

guillaumeeb commented Sep 19, 2018

mrocklin commented Sep 19, 2018

raybellwaves commented Sep 19, 2018

guillaumeeb commented Sep 20, 2018

Avoid KeyError on worker restart #138

Avoid KeyError on worker restart #138

Conversation

guillaumeeb commented Aug 24, 2018

guillaumeeb commented Aug 24, 2018

wgustafson commented Aug 25, 2018

raybellwaves commented Aug 25, 2018 • edited Loading

guillaumeeb commented Aug 25, 2018

wgustafson commented Aug 27, 2018

lesteve commented Aug 29, 2018

raybellwaves commented Aug 30, 2018

raybellwaves commented Sep 18, 2018 • edited Loading

guillaumeeb commented Sep 19, 2018

mrocklin commented Sep 19, 2018

raybellwaves commented Sep 19, 2018

guillaumeeb commented Sep 20, 2018

raybellwaves commented Aug 25, 2018 •

edited

Loading

raybellwaves commented Sep 18, 2018 •

edited

Loading