more adaptive scaling fixes #97

jhamman · 2018-07-17T15:14:49Z

@lesteve submitted a review on #63 after we merged. The changes he requested were mostly small so I'm just addressing them here.

closes #112

mrocklin · 2018-07-18T17:12:29Z

Some interesting failures here

jhamman · 2018-07-18T19:33:51Z

Some interesting failures here

I agree. I plan to dig into them further tomorrow. I think the switch to using scale() has triggered some additional race conditions that we'll want to chase down.

… post_63_fixes

…ixes

jhamman · 2018-08-07T20:12:03Z

@lesteve - may I ask for a review here?

If anyone has any ideas on how to move forward here, I would certainly appreciate it. It seems that my change to scale() instead of stop_workers() in our tests has exposed a bug somewhere.

jhamman · 2018-08-08T00:53:54Z

dask_jobqueue/core.py

        self.start_workers(n - active_and_pending)

    def scale_down(self, workers):
        ''' Close the workers with the given addresses '''
-        logger.debug("Scaling down. Workers: %s" % workers)
+        logger.debug("Scaling down. Workers: %s", workers)
        worker_states = []
        for w in workers:


@mrocklin - I've been working on debugging the failures we're seeing and I may have stumbled onto a real bug. I'll lay out what I can understand.

when we call scale(0), we correctly end up in the scale down method

however, workers is of type (_asyncio.Future) and is not a list of workers

workers.result() yields an empty list.

From here, I've tried a bunch of things including:

removing the loop.add_callback calls in scale

tweaking the remove/close_workers in retire_workers

But alas, I'm pretty lost. Any pointers would be appreciated.

Yup, lost with reason. That looks like a pretty clear bug. I'm not sure how that ever would have worked.

For context. Scheduler.retire_workers is a coroutine

@gen.coroutine def retire_workers(self, comm=None, workers=None, remove=True, close_workers=False, **kwargs):

This means that it is meant to be called within other coroutines, with a yield statment

@gen.coroutine def f(): yield scheduler.retire_workers(...)

This is true of any internal dask scheduler method that communicates with other things or does anything that might take any non-trivial amount of time.

In this way, coroutines are a bit viral, which in our case is bad, because we don't want scale to be a coroutine, because users use it, and users get confused by coroutines.

Another approach is to call coroutines with add_callback which says "run this whenever you have a moment, but don't lets deal with it now, because it's a bit messy".

So probably the thing to do here is in distributed/deploy/cluster.py::Cluster.scale we want to call self.scheduler.workers_to_close which is thankfully just a normal method and then we want to add self.scheduler.retire_workers as a callback with those workers to be run in just a moment, then we want to call scale_down as we do currently (which is also as a callback, just in case the cluster object implements it as a coroutine as well.

I can do this, but I'd also be very happy to guide someone else through it in the interests of spreading some of this knowledge around.

To be clear the problem here is that we were calling it as a normal function

def scale(...): ... to_close = self.scheduler.retire_workers(...)

without a yield within a function that was not a coroutine. In this case it returns an opaque asyncio/torando Future object. This is never a good thing to see and in our case almost always signifies a bug.

@mrocklin - I'm happy to push this forward if you can provide some hand holding. https://github.com/jhamman/distributed/commit/ad0256006bc7d5018277da0d6c686224ffaaa8fe is a first attempt based on my reading of what is above. This does cause some tests to fail but I think it is worth sharing in its current form for initial feedback.

…ixes

guillaumeeb

So you did the choice of overloading scale method in LocalCluster. Is this really simpler than the solution to handle active_and_pending jobs in scale_up?

guillaumeeb · 2018-08-09T06:14:52Z

dask_jobqueue/core.py

+                    jobs = list(self.pending_jobs.keys())[to_kill:]
+                    self.stop_jobs(jobs)
+                else:
+                    # we need to retire some workers (and maybe pending jobs)


I think that here we need to remove all pending jobs, so comment seems false, or is there something I don't understand ?

Yes, this comment is a bit misleading. What I meant to say was that all pending jobs will be killed too.

guillaumeeb · 2018-08-09T06:19:55Z

dask_jobqueue/lsf.py

+        jid = out.split('<')[1].split('>')[0].strip()
+        if not jid:
+            raise ValueError('Unable to parse jobid from output of %s' % out)
+        return jid


Shouldn't we wrap the job-id inside core.py, when we call _job_id_from_submit_output? It would be simpler.

jhamman · 2018-08-10T03:32:45Z

dask_jobqueue/core.py

+        """
+        with log_errors():
+            active_and_pending = self._count_active_and_pending_workers()
+            if n >= active_and_pending:


these two lines are why I am currently overloading the scale method. We need to know if we want to scale up or scale down and that depends on the number of pending workers. The current Cluster.scale() method does not have a way to evaluate pending workers.

This part is currently done in scale_up, maybe not correctly, but what prevents us from doing it there?

Something like this:

def scale_up(self, n, **kwargs): """ Brings total worker count up to ``n`` """ active_and_pending = self._count_active_and_pending_workers() if n >= active_and_pending: self.start_workers(n - self._count_active_and_pending_workers()) else: n_to_close = active_and_pending - n if n_to_close < self._count_pending_workers(): # We only need to kill some pending jobs to_kill = int(n_to_close / self.worker_processes) jobs = list(self.pending_jobs.keys())[to_kill:] self.stop_jobs(jobs) else: raise something, we should never be there in scale_up.

guillaumeeb · 2018-08-17T13:43:18Z

I will give a go to this one probably next week as discussed in #130. Thanks @jhamman.

guillaumeeb · 2018-08-22T07:41:46Z

Gave a try today to this PR. Works fine! Actually, I don't have any problem on master branch when using scale and adaptive, and this one works good too (no noticable change from user perspective).

Just one case in which I se some faulty behaviour (but seeing it too on master) :

# Create some cluster
cluster = PBSCluster(processes=2, cores=4, memory="20GB")
# Scale workers up
cluster.scale(8)
# Correct number of jobs and worker showing up
# Do things, then scale down
cluster.scale(4)
# Only one job and two workers left in running jobs :
cluster.running_jobs
 Out:   OrderedDict([('6732209',
              {'dask-worker--6732209---0': <Worker 'tcp://10.135.36.116:40331', memory: 0, processing: 0>,
               'dask-worker--6732209---1': <Worker 'tcp://10.135.36.116:35941', memory: 0, processing: 0>})])

One way or another, it looks like something's wrong with

_adaptive_options = {
        'worker_key': lambda ws: _job_id_from_worker_name(ws.name)}

guillaumeeb · 2018-08-30T08:09:37Z

Could we enable some debug logs in the test here, to have more insights of why tests are failing?

guillaumeeb · 2018-09-18T14:26:22Z

So taking another look at this:

issues from more adaptive scaling fixes #97 (comment) are fixed in Taking into account grouped workers in scale #153.
I'm still against redefining scale() method in core.py
problem in the tests are due to the newly added one, which I suspect leaves some PBS jobs running blocking subsequent tests.
I think the fix to the race condition seen here should be done upstream in distributed, I'll raise an issue to discuss this.

jhamman · 2018-09-18T14:36:27Z

@guillaumeeb - thanks for taking a look at this. My attentions have obviously been elsewhere lately so I appreciate the persistence on your part.

guillaumeeb · 2018-09-18T15:28:37Z

So I propose this new way without overriding scale. And if we manage to correct upstream issue dask/distributed#2257, this should be a good step.

Not sure how to handle the test test_basic_scale_edge_case though.

guillaumeeb · 2018-10-06T19:55:55Z

So as discussed previously, the scale(0) operation here https://github.com/dask/dask-jobqueue/pull/97/files#diff-660047dbe8333ae717d8f94cc4162529R127 is basically a noop as cluster state is evaluated before scale(2) has done any modification to the current state. This is why the associated test fails.

In order to advance here, I propose to disable the test here for the time being, and to merge this PR as it provides some nice enhancement to dask-jobqueue.

I propose then to implement dask/distributed#2257 in another PR directly in dask-jobqueue, meaning overload scale or even upstream Cluster object. This should hopefully fix the failing test here. As discussed in #130, this should provide some interesting lessons for dask/distributed#2235.

… time

jhamman · 2018-10-06T21:32:38Z

Thanks @guillaumeeb. I like the plan you've laid out. Happy to see this merged as is.

misc fixes to logging and tests

97fdd4c

jhamman mentioned this pull request Jul 20, 2018

Add flag to block until scaling finishes dask/dask-kubernetes#87

Closed

Joseph Hamman added 7 commits July 27, 2018 14:38

use scale_up

bbdb11f

further simplify tests

7a124e4

scale up/down

9c0952b

Merge branch 'post_63_fixes' of github.com:jhamman/dask-jobqueue into…

7bf661e

… post_63_fixes

Merge branch 'master' of github.com:dask/dask-jobqueue into post_63_f…

696451b

…ixes

cleanup

6438b43

raise explicit errors when job fails to submit properly

1b507ae

jhamman commented Aug 8, 2018

View reviewed changes

jhamman mentioned this pull request Aug 8, 2018

fix scale / avoid returning coroutines dask/distributed#2171

Merged

Joseph Hamman added 5 commits August 8, 2018 14:10

Merge branch 'master' of github.com:dask/dask-jobqueue into post_63_f…

9f8f71a

…ixes

wrap our own scale method

2cd6f2b

rollback docker file changes

b6101a4

add test for scale corner case

7b34665

fix typo

23bc4aa

guillaumeeb reviewed Aug 9, 2018

View reviewed changes

jhamman commented Aug 10, 2018

View reviewed changes

test for missing jobid in core.py

2be2b3a

jhamman mentioned this pull request Aug 13, 2018

Handling workers with expiring allocation requests #122

Closed

jhamman changed the title ~~misc fixes to logging and tests~~ more adaptive scaling fixes Aug 14, 2018

jhamman mentioned this pull request Aug 17, 2018

Use number of cores as a parameter to scale() #130

Closed

lesteve mentioned this pull request Aug 24, 2018

Release ? #127

Closed

guillaumeeb mentioned this pull request Aug 30, 2018

Use released version of dask and distributed in CI #143

Merged

guillaumeeb mentioned this pull request Sep 18, 2018

Cluster.scale is not robust to multiple calls dask/distributed#2257

Closed

cnes-datalabs-bot and others added 2 commits September 18, 2018 15:19

Proposing another solution, along with modif upstream in distributed

9e169a5

merge with master and flake checks

4c950c2

guillaumeeb added 3 commits October 6, 2018 22:03

skipping failing test for now

c4a4278

merging master

6e4af90

Trigger CI

3160dda

This was referenced Oct 6, 2018

Implementing an Ersatz of ClusterManager to fix jobqueue issues linked to upstream deploy.Cluster limitations #170

Closed

Fix scale edge cases #171

Merged

Not sure why docker pbs cluster started properly, Trigger CI one more…

c9d7c17

… time

guillaumeeb merged commit ee6e79e into dask:master Oct 7, 2018

ericmjl mentioned this pull request Jan 15, 2019

Resource specification on GridEngine Clusters #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more adaptive scaling fixes #97

more adaptive scaling fixes #97

jhamman commented Jul 17, 2018 •

edited

Loading

mrocklin commented Jul 18, 2018

jhamman commented Jul 18, 2018

jhamman commented Aug 7, 2018

jhamman Aug 8, 2018

mrocklin Aug 8, 2018

mrocklin Aug 8, 2018

jhamman Aug 8, 2018

guillaumeeb left a comment

guillaumeeb Aug 9, 2018

jhamman Aug 10, 2018

guillaumeeb Aug 9, 2018

jhamman Aug 10, 2018

guillaumeeb Aug 10, 2018

guillaumeeb commented Aug 17, 2018

guillaumeeb commented Aug 22, 2018

guillaumeeb commented Aug 30, 2018

guillaumeeb commented Sep 18, 2018

jhamman commented Sep 18, 2018

guillaumeeb commented Sep 18, 2018 •

edited

Loading

guillaumeeb commented Oct 6, 2018

jhamman commented Oct 6, 2018

more adaptive scaling fixes #97

more adaptive scaling fixes #97

Conversation

jhamman commented Jul 17, 2018 • edited Loading

mrocklin commented Jul 18, 2018

jhamman commented Jul 18, 2018

jhamman commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaumeeb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaumeeb commented Aug 17, 2018

guillaumeeb commented Aug 22, 2018

guillaumeeb commented Aug 30, 2018

guillaumeeb commented Sep 18, 2018

jhamman commented Sep 18, 2018

guillaumeeb commented Sep 18, 2018 • edited Loading

guillaumeeb commented Oct 6, 2018

jhamman commented Oct 6, 2018

jhamman commented Jul 17, 2018 •

edited

Loading

guillaumeeb commented Sep 18, 2018 •

edited

Loading