-
-
Notifications
You must be signed in to change notification settings - Fork 146
Closed
Description
I'm trying to setup dask with tpot.
My code looks like this:
from dask_jobqueue import LSFCluster
cluster = LSFCluster(cores=1, memory='3GB', job_extra=['-R rusage[mem=2048,scratch=8000]'],
local_directory='$TMPDIR',
walltime='12:00')
from dask.distributed import Client
client = Client(cluster)
cluster.scale(10)
from tpot import TPOTRegressor
reg = TPOTRegressor(max_time_mins=30, generations=20, population_size=96,
cv=5,
scoring='r2',
memory='auto', random_state=42, verbosity=10, use_dask=True)
reg.fit(X, y)
and I keep getting those annoying errors:
distributed.scheduler - ERROR - '74905774'
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1306, in add_worker
plugin.add_worker(scheduler=self, worker=address)
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/dask_jobqueue/core.py", line 62, in add_worker
self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '74905774'
distributed.utils - ERROR - Worker already exists tcp://10.205.103.50:35780
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/utils.py", line 648, in log_errors
yield
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1261, in add_worker
raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://10.205.103.50:35780
I think there might be a problem with LSFCluster because it puts a lot of workers in cluster.finished_jobs
that are still running according to bjobs
and even to the dask.distributed web interface.
Metadata
Metadata
Assignees
Labels
No labels