Skip to content

KeyError and Worker already exists #169

@louisabraham

Description

@louisabraham

I'm trying to setup dask with tpot.

My code looks like this:

  from dask_jobqueue import LSFCluster
cluster = LSFCluster(cores=1, memory='3GB', job_extra=['-R rusage[mem=2048,scratch=8000]'],
                    local_directory='$TMPDIR',
                    walltime='12:00')

from dask.distributed import Client
client = Client(cluster)
cluster.scale(10)

from tpot import TPOTRegressor

reg = TPOTRegressor(max_time_mins=30, generations=20, population_size=96,
                    cv=5,
                    scoring='r2',
                    memory='auto', random_state=42, verbosity=10, use_dask=True)
reg.fit(X, y)

and I keep getting those annoying errors:

distributed.scheduler - ERROR - '74905774'
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1306, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/dask_jobqueue/core.py", line 62, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '74905774'

distributed.utils - ERROR - Worker already exists tcp://10.205.103.50:35780
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/utils.py", line 648, in log_errors
    yield
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1261, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://10.205.103.50:35780

I think there might be a problem with LSFCluster because it puts a lot of workers in cluster.finished_jobs that are still running according to bjobs and even to the dask.distributed web interface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions