Fix scale edge cases #171

guillaumeeb · 2018-10-06T20:58:41Z

Fixes #112.

Implementation of dask/distributed#2257.

First step towards #170.

Must wait for #97 to be merged and activate test test_basic_scale_edge_cases.

guillaumeeb · 2018-10-07T08:15:59Z

Some question for @mrocklin here: as you mentioned in #97 #97 (comment), retire_workers() is a coroutine. Should all the _scale() function be made a coroutine and retire_workers() be called with a yield?

mrocklin · 2018-10-07T14:52:20Z

Should all the _scale() function be made a coroutine and retire_workers() be called with a yield?

Perhaps. There are a few ways to do this. For example it looks like in dask-kubernetes we make a small coroutine within the normal scale method and add it to the event loop asynchronously.

https://github.com/dask/dask-kubernetes/blob/a88ace69cab03aa9ce4bdb005bbb1f0be2d831d4/dask_kubernetes/core.py#L376-L385

But generally speaking it is a common pattern to make a _foo coroutine that gets called by a synchronous function foo.

mrocklin · 2018-10-07T15:01:23Z

Also if you have any questions about writing async code I'm happy to help. I really enjoy it now, but it does require some learning if you are not already familiar with it.

guillaumeeb · 2018-10-07T16:39:06Z

Interresting to see that dask-kubernetes already overrides scale method. Will have a closer look to what it does.

guillaumeeb · 2018-10-12T12:43:48Z

Okay, so the race condition identified in #112 seems fixed.

Work on building a ClusterManager has begun, but just in the scope of this PR.

Some questions left here:

Should _scale() be made a corountine?
I use a lock in _scale() so this probably means it should not be a corountine, but is the lock use correct?

Review welcomed.

mrocklin · 2018-10-12T13:17:44Z

dask_jobqueue/cluster_manager.py

+
+        This allows to do every operation with a coherent ocntext
+        """
+        with log_errors(), self._lock:


I use a lock in _scale() so this probably means it should not be a corountine, but is the lock use correct?

As written the lock is not necessary, as long as people call scale rather than _scale. The scale method adds this method onto the event loop:

self.scheduler.loop.add_callback(self._scale, n)

So any thread can safely call scale, which will ask the event loop thread to call _scale the next time it is free. So the _scale method will then ever run on the single event loop thread. There is no need to protect it with a lock.

Generally speaking when using async frameworks like asyncio or tornado it is very rare to use locks. Instead we protect ourselves by putting all concurrent code on the event loop.

Should _scale() be made a corountine?

Maybe. It calls the method self.scheduler.retire_workers. This method is a coroutine that waits for a response from the workers-to-be-closed. Should _scale also wait until it gets this response? If so then we'll want to yield/await that method call:

# self.scheduler.retire_workers(workers=to_close) yield self.scheduler.retire_workers(workers=to_close)

If you do that then you will need to make _scale a coroutine as well. You may not want this though, I'm not sure. That becomes a design decision. So I ask a question back to you:

Should _scale wait until the workers have closed themselves before calling scale_down?

Should _scale wait until the workers have closed themselves before calling scale_down?

Yes, I believe so for clean worker shutdown. Else there is a risk of losing in memory data. From what I understand, currently we ask the scheduler to shutdown worker cleanly, but then just kills them without waiting for it.

That's what the code does currently as I read it, yes.

So you would wait on the tornado future returned by self.scheduler.retire_workers by calling yield self.scheduler.retire_workers(...). Once you do that you need to make the _scale method a coroutine.

If you want the scale function to block then I recommend using the distributed.utils.sync function.

def scale(...): sync(loop, self._scale, ...)

This adds the _scale coroutine to the event loop, then waits on a threading.Event until it has finished. This must be called from a thread other than the event loop.

Thanks, is the code in Cluster.scale()

self.scheduler.loop.add_callback(self.scheduler.retire_workers, workers=to_close) self.scheduler.loop.add_callback(self.scale_down, to_close)

ensuring scale_down will be called after retire_workers has finished?

No. It will start retire_workers and continue that until it hits a yield point. It will start scale_down after that (although other things may come in between).

If you want the scale function to block then I recommend using the distributed.utils.sync function

I don't think I want that.

Otherwise, updates done, so all questions answered. Ready to other reviews if Travis build passes.

mrocklin · 2018-10-12T22:17:35Z

dask_jobqueue/cluster_manager.py

@@ -35,7 +34,8 @@ def _scale(self, n):
                to_close = self.scheduler.workers_to_close(
                    n=len(self.scheduler.workers) - n)
                logger.debug("Closing workers: %s", to_close)
-                self.scheduler.retire_workers(workers=to_close)
+                # Should  be an RPC call here
+                yield self.scheduler.retire_workers(workers=to_close)


As a warning, now that you've added a yield in this coroutine it's entirely possible for another coroutine to start running while this one waits for a response. It is entirely possible that two _scale coroutines will be active at the same time.

You still can't use a threading.Lock to fix this (threading locks will lock the entire event loop). You can use a Tornado lock, or a few other methods. Short term I wouldn't worry about it though.

Thanks for the precision. I don't think this is an issue yet.

guillaumeeb · 2018-10-15T08:48:38Z

So @mrocklin @jhamman, is it OK to merge this and go on with #170?

mrocklin · 2018-10-15T12:21:09Z

One question: it seems like this is might be the start of many changes. Do we want to issue a release before that happens?

guillaumeeb · 2018-10-15T12:45:49Z

Version 4.0 is not quite old, but some minor changes may have happened since then. I need to look in detail, but maybe a 4.1?

mrocklin · 2018-10-15T12:51:38Z

No thoughts on the version number. 0.4.1 seems fine to me. Mostly I want to avoid the situation where someone wants some of the recent changes but we feel uncomfortable releasing because of some of the new changes. It is a small thought though and not very important.

…bqueue into robust_scale

guillaumeeb · 2018-10-20T21:03:29Z

I discovered when merging master and performing some tests that adaptive was directly calling scale_up and scale_down.

I'm not sure if this is correct, I currently have the feeling that adaptive should only rely on scale(). ClusterManager should then know if it needs to gracefully retire worker or anything else. This maybe something to keep aside for later. cc @mrocklin.

Otherwise I think this is ready to go in if Travis build succeed. This already fixes some bug I observed on scaling with or without adaptive (adaptive endless loop or with multiple scale calls).

guillaumeeb added 3 commits October 6, 2018 22:55

Fix scale edge cases

6750f01

Merge branch 'master' into robust_scale

c4db1f7

enabling scale edge case test

5e834c2

use add_callback correctly

189b263

guillaumeeb added 2 commits October 11, 2018 14:07

Fix concurency issue and other code pieces

e8b2fb1

wait some time for jobs to be cancelled

0be29be

mrocklin reviewed Oct 12, 2018

View reviewed changes

using coroutine to yeild scheduler calls. Removing lock

381dcdc

mrocklin reviewed Oct 12, 2018

View reviewed changes

guillaumeeb mentioned this pull request Oct 15, 2018

0.4.1 release #174

Closed

guillaumeeb added 3 commits October 20, 2018 20:38

merge master and fix adaptive issues

2406f6d

Merge branch 'robust_scale' of https://github.com/guillaumeeb/dask-jo…

22d53d0

…bqueue into robust_scale

some more adaptive and scale tests

b8c7836

guillaumeeb mentioned this pull request Oct 20, 2018

Redesign Cluster Managers dask/distributed#2235

Closed

guillaumeeb added 2 commits October 29, 2018 10:54

Comments on cluster Manager desgin need

0a2d2ce

new flake check

383854e

guillaumeeb merged commit 9fe5240 into dask:master Oct 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scale edge cases #171

Fix scale edge cases #171

guillaumeeb commented Oct 6, 2018

guillaumeeb commented Oct 7, 2018

mrocklin commented Oct 7, 2018

mrocklin commented Oct 7, 2018

guillaumeeb commented Oct 7, 2018

guillaumeeb commented Oct 12, 2018

mrocklin Oct 12, 2018

guillaumeeb Oct 12, 2018

mrocklin Oct 12, 2018

guillaumeeb Oct 12, 2018

mrocklin Oct 12, 2018

guillaumeeb Oct 12, 2018

mrocklin Oct 12, 2018

guillaumeeb Oct 13, 2018

guillaumeeb commented Oct 15, 2018

mrocklin commented Oct 15, 2018

guillaumeeb commented Oct 15, 2018

mrocklin commented Oct 15, 2018

guillaumeeb commented Oct 20, 2018

Fix scale edge cases #171

Fix scale edge cases #171

Conversation

guillaumeeb commented Oct 6, 2018

guillaumeeb commented Oct 7, 2018

mrocklin commented Oct 7, 2018

mrocklin commented Oct 7, 2018

guillaumeeb commented Oct 7, 2018

guillaumeeb commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaumeeb commented Oct 15, 2018

mrocklin commented Oct 15, 2018

guillaumeeb commented Oct 15, 2018

mrocklin commented Oct 15, 2018

guillaumeeb commented Oct 20, 2018