Locally rerun successfully completed futures #4813

ms7463 · 2021-05-13T08:25:13Z

Closes Is there a way to rerun a successfully completed Future? #4804
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

Initial concept for locally rerunning a successfully completed future (rerunning failed futures functionality already exists). (Discussed in ticket #4804). Reason you'd want to do this is, for example, if the task completes successfully but is giving results you wouldn't expect. With this PR you can now do something like this:

>>> import pdb

>>> pdb.runcall(client.recreate_task_locally, future)
(Pdb) b path/to/file:<line no>
(Pdb) c
<etc.>

I believe this functionality is a superset of recreate_error_locally, but leaving that in place, in case you want to ensure the future you're calling actually did fail.

In the interest of reusing as much code as possible I reimplemented some of the existing methods with a more general version that takes a callback so the general recreate_tasks/error can pass in their relevant differences with the callback. Not married to this idea though.

…re/recreate_tasks

fjetter

A second opinion about this new feature would be appreciated but I think it makes sense. I'm also not entirely sure how backwards/forwards compatible we typically want these protocols to be, cc @jrbourbeau

regarding the implementation I would suggest a slightly different approach. With the component_getter, details_extractor and runspec_getter arguments, I feel we are overdoing it a bit in terms of flexibility and are creating a more complex plugin than necessary.

Is there a particular reason why we'd need these hooks? My impression is that they require a very thorough understanding of the extension and the scheduler and are not well suited for a high level API. Therefore, I imagine them to be actually only used by us in this implementation. If this is correct, I would propose a slightly different interface to reduce complexity. pseudo code below but all necessary pieces to put this together are already in this PR.

class ReplayTaskSchedulerExtension:
    ...

    def get_runspec(self, key) -> Dict:
        """
        Returns the runspec of a single key
        """
        ts = self.tasks[key]
        ...
        return {
            "task": ts.run_spec,
            "deps": [dts.key for dts in ts.dependencies]
        }

    def get_error_cause(self, key) -> Optional[str]:
        """
        Returns the
        """
        ts = self.tasks[key]
        return ts.exception_blame.key


class ReplayTaskClient:

    async def recreate_error_locally(self, future):
        cause = await self.scheduler.get_error_cause(future.key)
        spec_dct = await self.scheduler.get_runspec(cause.key)
        return await self._execute_spec(spec_dct)

    async def recreate_task_locally(self, future):
        spec_dct = await self.scheduler.get_runspec(future.key)
        return await self._execute_spec(spec_dct)


    async def _execute_spec(self, spec_dct):
        """Unpack and execute the response of
        `ReplayTaskSchedulerExtension.get_runspec`
        """
        # Graph to future, pack, unpack, etc
        pass

fjetter · 2021-05-14T08:37:04Z

distributed/recreate_tasks.py

    """

    def __init__(self, scheduler):
        self.scheduler = scheduler
        self.scheduler.handlers["cause_of_failure"] = self.cause_of_failure
+        self.scheduler.handlers["get_runspec"] = self.get_runspec
        self.scheduler.extensions["exceptions"] = self


We might want to rename this extension here, e.g. replay-tasks?

renamed. does this need to be changed elsewhere? kinda hard to tell where references to "exceptions" in the code base are relevant to this or not.

I think not. If in doubt keep as is. Once you have a green CI you can still rename in the end

ms7463 · 2021-05-14T10:13:58Z

The only reason for the hooks was to maintain as much of the original structure/code as possible, while allowing for different implementations. Not sure of a proper term for it, but kinda like a functional style inheritance without classes. In any case, if we're ok with changing up the structure of the code (and API slightly), then I think your suggestions make sense. I'll take a look over the weekend.

…af/distributed into feature/recreate_tasks

ms7463 · 2021-05-15T05:35:05Z

distributed/recreate_tasks.py

+        await wait(future)
+        # one reason not to pass spec into this function is in case we want to
+        # expose a method to get all the components/deps in the future. This
+        # way it will be easier to do.


@fjetter - I deviated a bit from your suggestion to reference/fetch the components by spec, for the reason stated above. This comment is referring to a method similar to the current "get_futures_error" which I removed for now for the sake of removing some noise from the PR while restructuring, but can add back. Although I think the recreate_task_locally(run=False) would provide most of that functionality (with the deps resolved instead of raw)

I do not mind a different structure. I do care about API surface, though, and would like to only introduce features as needed.

For instance, the run argument is something I feel is not necessary. It adds some clutter to the function signature and changes the behaviour of the function significantly. Opposed to returning a tuple with well defined semantics, it instead will execute a function and will return something.
For the debugging use case it sounds like the run=False is actually what you are looking for. What about just exposing this as part of the API and tell users to call the function themselves?

Got it. I'll remove the run arg, so that each method only has one way of processing things. I think it would be worth it to make a final decision on the set of publicly exposed methods that we would want before I add any more commits. We have all the pieces now to make it work so deciding on this set is all that should be left.

In my eyes this is the superset of what we could potentially expose (names can be shortened, potentially overly descriptive currently),

get_raw_components_from_future - returns func, and raw args/kwargs including raw unresolved future inputs/deps get_raw_components_from_errored_future - same as above but implicitly calls get_errored_future equiavelent to current implementation of get_futures_error get_components_from_future - similar to raw_components, but args/kwargs/deps are resolved/gathered get_components_from_errored_future - same as above but implicitly calls get_errored_future get_errored_future - takes a future collection object and returns the future responsible for causing error recreate_task_locally - runs the code for a future on your local machine recreate_error_locally - same as above but implicitly calls get_errored_future

Personally I think the only things we should expose are: recreate_task_locally and recreate_error_locally. I feel like the case where you want just the components and not execute the future code is not common for something that's supposed to be a debugging tool, especially since you can inspect the args/kwargs through pdb, when using recreate_*_locally.

sounds good to me. Keeping the recreate_*_locally methods seems intuitive to me

@fjetter - made these changes. also added some analogous recreate_task_locally tests to the existing recreate_error_locally tests (which I also slightly updated to account for private api changes). All of which pass my pytest runs (scoped to just the recreate_* tests). Also added some preliminary docstrings, which could probably use some work.

ms7463 · 2021-05-20T04:58:04Z

@fjetter - any thoughts on the general direction of these new changes? Think it's worth pushing this current setup forward with some tests, minor api tweaks, and docstrings?

distributed/recreate_tasks.py

ms7463 · 2021-05-22T06:28:05Z

Commits above under the name "darle" are from me. Committed those from a different computer with the global name not set to mine. Looks like there's some ways to ammend this and replace the commit, but don't want to mess anything up, will leave as is.

fjetter · 2021-05-25T09:02:06Z

distributed/tests/test_client.py

+@gen_cluster(client=True)
+async def test_recreate_task_array(c, s, a, b):
+    da = pytest.importorskip("dask.array")
+    pytest.importorskip("scipy")


this skip isn't necessary, is it?

Copied this from the existing test_recreate_error_array test. Looks like it's necessary in that one (not sure why), but not this one. Removed.

fjetter · 2021-05-25T09:02:55Z

distributed/tests/test_client.py

+    assert function(*args, **kwargs) == [3628800, 1814400, 1209600, 907200, 725760,
+                                         604800, 518400, 453600, 403200, 362880]


This will probably collide with black. I recommend setting up pre-commit or run the formatting yourself before pushing

updated with black formatting

fjetter · 2021-05-25T09:11:06Z

distributed/recreate_tasks.py

+        if not futures:
+            raise ValueError("No errored futures passed")
+        cause = await self.scheduler.get_error_cause(keys=futures)
+        return Future(cause)


Is this future instatiation necessary? Futures are a bit brittle objects and leak some abstraction. In particular they notify the scheduler about this and perform some reference counting. In particular the reference counting usually needs to be protected by a threading lock, see Client._refcount_lock. I'm wondering if this all can be implemented by just passing keys around without messing with ref counting

Not necessary. updated to work with they key (or future) instead.

ms7463 · 2021-06-02T01:43:55Z

@fjetter - the only failures from the previous CI didn't seem to be related to the changes I had made. Merged in main in case that resolves those issues. Also, any thoughts on the latest changes?

fjetter

Changes look good. I triggered CI. If nothing is popping up there, this is in. Thanks for the thorough work @ArtinSarraf !

ms7463 · 2021-06-14T02:44:54Z

@fjetter seems like the 2 failing tests are hanging on something. Don't think it's related to these changes, or if it is, not obvious to me how so. I can intermittently recreate the error in local tests (this one: tornado.util.TimeoutError: Operation timed out after 60 seconds), but it happens on random tests, and not when I rerun that specific test independently.
Any ideas?

fjetter · 2021-06-14T09:02:26Z

Sorry for letting this sit for so long. The seem to be unrelated. One of the failures is #4859 and for the other I opened #4914

Thanks for your contribution @ArtinSarraf !

ms7463 · 2021-06-14T13:01:55Z

Great, thanks for your time and help @fjetter

Artin Sarraf and others added 2 commits May 12, 2021 22:13

initial commit, add recreate_task_locally, rename some classes/files

1c015f0

Merge branch 'main' of https://github.com/dask/distributed into featu…

5412f59

…re/recreate_tasks

fjetter reviewed May 14, 2021

View reviewed changes

Artin Sarraf added 2 commits May 14, 2021 19:29

make code more organized/simple

f5c630d

Merge branch 'feature/recreate_tasks' of https://github.com/ArtinSarr…

eb38e86

…af/distributed into feature/recreate_tasks

ms7463 commented May 15, 2021

View reviewed changes

fjetter reviewed May 20, 2021

View reviewed changes

distributed/recreate_tasks.py Outdated Show resolved Hide resolved

darle added 2 commits May 21, 2021 22:14

simplify api, update docstrings.

ce03802

add recreate_task_locally tests, update recreate_error_locally tests

85f47b1

fjetter reviewed May 25, 2021

View reviewed changes

ms7463 added 3 commits May 25, 2021 20:25

use future key instead of future, formatting

7b88a32

remove unneccesary import

dd295c4

Merge https://github.com/dask/distributed into feature/recreate_tasks

b1d4273

fjetter approved these changes Jun 2, 2021

View reviewed changes

fjetter merged commit e690e82 into dask:main Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locally rerun successfully completed futures #4813

Locally rerun successfully completed futures #4813

ms7463 commented May 13, 2021

fjetter left a comment

fjetter May 14, 2021

ms7463 May 15, 2021

fjetter May 20, 2021

ms7463 commented May 14, 2021

ms7463 May 15, 2021

fjetter May 20, 2021

ms7463 May 21, 2021

fjetter May 21, 2021

ms7463 May 22, 2021 •

edited

Loading

ms7463 commented May 20, 2021

ms7463 commented May 22, 2021

fjetter May 25, 2021

ms7463 May 26, 2021

fjetter May 25, 2021

ms7463 May 26, 2021

fjetter May 25, 2021

ms7463 May 26, 2021

ms7463 commented Jun 2, 2021

fjetter left a comment

ms7463 commented Jun 14, 2021

fjetter commented Jun 14, 2021

ms7463 commented Jun 14, 2021

		assert function(args, *kwargs) == [3628800, 1814400, 1209600, 907200, 725760,
		604800, 518400, 453600, 403200, 362880]

Locally rerun successfully completed futures #4813

Locally rerun successfully completed futures #4813

Conversation

ms7463 commented May 13, 2021

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ms7463 commented May 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ms7463 May 22, 2021 • edited Loading

Choose a reason for hiding this comment

ms7463 commented May 20, 2021

ms7463 commented May 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ms7463 commented Jun 2, 2021

fjetter left a comment

Choose a reason for hiding this comment

ms7463 commented Jun 14, 2021

fjetter commented Jun 14, 2021

ms7463 commented Jun 14, 2021

ms7463 May 22, 2021 •

edited

Loading