Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in plugin.add_worker #117

Closed
raybellwaves opened this issue Aug 6, 2018 · 7 comments
Closed

KeyError in plugin.add_worker #117

raybellwaves opened this issue Aug 6, 2018 · 7 comments

Comments

@raybellwaves
Copy link
Member

I finally got to kick the tires of the LSFCluster today.

I created https://github.com/raybellwaves/dask-jobqueue_test_lsf/blob/master/dask-jobqueue_test_lsf.ipynb which is adapted from https://www.youtube.com/watch?v=nH_AQo8WdKw

The general queue was heavily loaded today so even though I set off 50 workers only ~2/3 were running (the others remained pending).
screen shot 2018-08-06 at 11 36 38 am

When doing df = df.persist() an error was raised:

distributed.scheduler - ERROR - '17061795'
Traceback (most recent call last):
  File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '17061795'

This error appeared multiple times:

``` distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061796' distributed.scheduler - ERROR - '17061796' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061796' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061797' distributed.scheduler - ERROR - '17061797' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061797' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker plugin.remove_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 76, in remove_worker del self.running_jobs[job_id][name] KeyError: '17061795' distributed.scheduler - ERROR - '17061795' Traceback (most recent call last): File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker plugin.add_worker(scheduler=self, worker=address) File "/nethome/rxb826/local/bin/miniconda3/envs/djq_lsf/lib/python3.6/site-packages/dask_jobqueue/core.py", line 63, in add_worker self.running_jobs[job_id] = self.pending_jobs.pop(job_id) KeyError: '17061795' ```

screen shot 2018-08-06 at 11 45 06 am

You may be more familiar with this @jhamman

@guillaumeeb
Copy link
Member

It looks like you've got a memory issue. Calling df.persist() tries to load your generated dataset into memory, but apparently there is not enough memory per worker, so I assume some workers die and try to restart, this is always the same job ids that appear in your log.

We should probably handle better this case though. Do some Error catching, or and if somewhere there https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/core.py#L63.

@jhamman
Copy link
Member

jhamman commented Aug 7, 2018

I agree. There is a good chance your persist call is running out of memory. However, the KeyError worries me.

If line 63 raises a KeyError, that means we were not tracking that key as a pending job. This sounds like a bug. Could be in in the submit step's parsing or in the parsing of the worker name. This would probably be worth debugging.

def add_worker(self, scheduler, worker=None, name=None, **kwargs):
''' Run when a new worker enters the cluster'''
logger.debug("adding worker %s" % worker)
w = scheduler.workers[worker]
job_id = _job_id_from_worker_name(w.name)
logger.debug("job id for new worker: %s" % job_id)
self.all_workers[worker] = (w.name, job_id)
# if this is the first worker for this job, move job to running
if job_id not in self.running_jobs:
logger.debug("this is a new job")
self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
# add worker to dict of workers in this job
self.running_jobs[job_id][w.name] = w

@raybellwaves
Copy link
Member Author

raybellwaves commented Aug 7, 2018

Yes I think it is a memory issue as well. It got it to work by making my test smaller (one year instead of ten) and made minor modifications such as adding the walltime and submitting to our bigmem queue, which can have 250GB per job https://github.com/raybellwaves/dask-jobqueue_test_lsf/blob/master/dask-jobqueue_test_lsf.ipynb

I think it's going to be quite common for me to only have a few jobs running (the pegasus queue load varies week-to-week) and lots pending so I can see this popping up quite a bit for us at UM.

@guillaumeeb
Copy link
Member

If line 63 raises a KeyError, that means we were not tracking that key as a pending job. This sounds like a bug. Could be in in the submit step's parsing or in the parsing of the worker name. This would probably be worth debugging.

Not sure if it is possible, but my assumption was that:

  1. Job id is in pending state, worker starts for the first time, job id leaves pending state.
  2. Worker gets out of memory and is shutdown, but the callback remove_worker is still called.
  3. Worker is restarted, but there is now nothing in pending state.

@jhamman
Copy link
Member

jhamman commented Aug 8, 2018

okay, that makes me think we should add a restart method to the plugin class.

@raybellwaves
Copy link
Member Author

raybellwaves commented Aug 8, 2018

FYI An automatic e-mail came through today regarding the memory limit

subject: Process killed - Compute node exceeded memory limit

rxb826: 18 process killing(s) noticed on pegasus2
Aug 7 15:50:40: python (mem=56452 MB, pid=21564) on node n087
Aug 7 15:50:42: python (mem=62749820kB, pid=32598) on node n091
Aug 7 15:50:43: python (mem=61939756kB, pid=11739) on node n086
Aug 7 15:50:44: python (mem=61939756kB, pid=11881) on node n086
Aug 7 15:50:43: python (mem=61737972kB, pid=2542) on node n094
Aug 7 15:50:45: python (mem=61939756kB, pid=11882) on node n086
Aug 7 15:50:46: python (mem=58244 MB, pid=29041) on node n090
Aug 7 15:50:46: python (mem=61939756kB, pid=11883) on node n086
Aug 7 15:51:22: python (mem=59914760kB, pid=21784) on node n087
Aug 7 15:51:23: python (mem=59914760kB, pid=21805) on node n087
Aug 7 15:51:24: python (mem=59914760kB, pid=21807) on node n087
Aug 7 15:51:26: python (mem=59914760kB, pid=21808) on node n087
Aug 7 15:51:26: python (mem=62716312kB, pid=1088) on node n091
Aug 7 15:51:27: python (mem=59914760kB, pid=21810) on node n087
Aug 7 15:51:27: python (mem=58595 MB, pid=2787) on node n094
Aug 7 15:51:28: python (mem=58739 MB, pid=11965) on node n086
Aug 7 15:51:29: python (mem=58263 MB, pid=29262) on node n090
Aug 7 15:51:30: python (mem=61669952kB, pid=29262) on node n090

@raybellwaves
Copy link
Member Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants