Improve recovery time in worker failure scenarios #3184

fjetter · 2019-10-30T14:55:34Z

We are operating our distributed clusters in a cloud environment where we need to deal with frequently failing nodes. We usually dispatch jobs automatically and are bound to certain SLAs and therefore expect our jobs to finish in a more or less well defined time.
While distributed offers resilience in terms of graph recalculation we're facing the issue that the recalculation introduces severe performance issues for us.

We are looking for something which would allow us to recover faster in scenarios where individual workers die such that we do not need to recalculate large, expensive chunks of the graph, e.g. by persisting or replicating valuable, small intermediate results.

Ideally the solution would be handled by the scheduler itself, s.t. many different applications can benefit of it (e.g. via a scheduler plugin/extension). We were thinking about milestone/snapshotting where the user can label certain results to be worthy to be repliacated (and later forgotten once another milestone passes/completes). We also discussed some kind of automatic replication based on heuristics (e.g. bytes_result < x and runtime of task > Y -> replicate result) to soften the blow in case of failures.

My questions would be:

Does anybody have additional ideas we should take into account?
Could anybody already gather some experience with similar situations which might help us?
If we would start to implement something like this, what would be the best approach / where should we start?
Are we doing something fundamentally wrong? :)

Researching existing github issues, I only found #2748 which discusses this scenario briefly but is ultimately closed without a proper resolution to this topic. The only solution which is suggested is a caching library but persisting every single result is most likely not an option.

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-11-12T17:03:07Z

Hi @fjetter ,

My apologies for the delay in response. You raised this when I was fairly busy, and I hadn't gone through old issues until now.

Everything you say makes a lot of sense to me. I think that Dask could really use an active memory manager that thinks about replication like how you describe. It's unlikely that I personally will build such a thing, but I think that you and your group could handle it easily.

Here are some links to prior discussion that might be helpful in informing design.

An old issue talking about retiring the current replicate method with a more active system like how you describe: Distributed memory management plugin #1002
A very old post on opportunistic caching: https://matthewrocklin.com/blog/work/2015/08/03/Caching . Also https://docs.dask.org/en/latest/caching.html

But neither is exactly what you are talking about. I think that ideally this would be a Scheduler extension that would keep a desired replication of every piece of data (informed either by heuristic, or explicit request from the user) and would send out requests to workers to duplicate or remove data as necessary to keep things on target. This sounds like an interesting problem to solve, and highly valuable for many applications.

Another approach would be to have workers retire themselves gracefully as they leave the cluster. My guess is that when your nodes die there is some way to have them give you a bit of warning. In these cases you could have them call self.close_gracefully() and move their data to other peers before leaving the network.

fjetter · 2019-11-13T10:30:09Z

This is already quite helpful. The memory manager sounds quite promising but I will need to talk to my team first and will come back to you if we intend to pursue this further.

Another approach would be to have workers retire themselves gracefully

We were also discussing graceful downscaling but there are still some issues for us

Ideally the solution should handle ungraceful downscaling since we can not necessarily give the guarantees that the worker gets enough time to handle its shutdown gracefully. We realise that this is a big ask, though.
Mostly related to our setup but we are not quite sure how we would even trigger the graceful downscaling. We wouldn't want to rely on Client connections since we don't want to couple our cluster manager to any particular distributed version (Protocol/API stability, etc.).

Regarding 2. a thought just popped into my mind. Not sure if this is feasible, yet. Did you ever think about signal handlers for the worker/nanny?
Something like SIGTERM -> stop calculation immediately and retire gracefully if possible.

mrocklin · 2019-11-13T15:18:19Z

Yes. There was some early attempt at this here I think: #2844

…

On Wed, Nov 13, 2019 at 2:30 AM fjetter ***@***.***> wrote: This is already quite helpful. The memory manager sounds quite promising but I will need to talk to my team first and will come back to you if we intend to pursue this further. Another approach would be to have workers retire themselves gracefully We were also discussing graceful downscaling but there are still some issues for us 1. Ideally the solution should handle ungraceful downscaling since we can not necessarily give the guarantees that the worker gets enough time to handle its shutdown gracefully. We realise that this is a big ask, though. 2. Mostly related to our setup but we are not quite sure how we would even trigger the graceful downscaling. We wouldn't want to rely on Client connections since we don't want to couple our cluster manager to any particular distributed version (Protocol/API stability, etc.). Regarding 2. a thought just popped into my mind. Not sure if this is feasible, yet. Did you ever think about signal handlers for the worker/nanny? Something like SIGTERM -> stop calculation immediately and retire gracefully if possible. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3184?email_source=notifications&email_token=AACKZTGLX6WFMLFY3FDXK5LQTPJLDA5CNFSM4JG2YGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED5VA5Q#issuecomment-553341046>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHGZZKINUGEHKVSU33QTPJLDANCNFSM4JG2YGZQ> .

mrocklin · 2019-11-15T21:38:07Z

The memory manager sounds quite promising but I will need to talk to my team first and will come back to you if we intend to pursue this further.

I am also quite happy to engage here. I think that actve memory management is important for many workloads. I have some rough thoughts on how to design this, but I don't currently have enough continuous time to devote to it. I would be very happy to engage with regular meetings, review, and so on.

StephanErb · 2019-11-18T17:30:38Z

Another approach would be to have workers retire themselves gracefully as they leave the cluster. My guess is that when your nodes die there is some way to have them give you a bit of warning. In these cases you could have them call self.close_gracefully() and move their data to other peers before leaving the network.

For anyone coming here with a similar issue, we will now try the following: Teach our container scheduler to notify up-front if a worker is taken down. This does not help in cases of node failures but with regular maintenance and auto-scaling it should be effective.

We cannot solve that easily via SIGTERM as we also run reverse proxies / side-cars and need a coordinated shutdown. In our case having a simple HTTP method is the simplest solution for now. I am not sure if such a thing will be to general interest upstream, and we therefore take the following live for now as a tactical fix:

    import logging

    import click

    logger = logging.getLogger("distributed.custom")

    @click.command()
    @click.option("--graceful-shutdown-endpoint")
    def dask_setup(worker, graceful_shutdown_endpoint):
        """ Loaded at Dask worker startup via --preload """
        try:
            from distributed.dashboard.utils import RequestHandler
            from distributed.dashboard.worker import routes

            class TerminationHandler(RequestHandler):
                """
                Custom HTTP handler to trigger a graceful shutdown via Aurora's HTTP Lifecycle
                """

                def post(self):
                    logger.info(
                        "HTTP lifecycle triggered. Initiating graceful shutdown."
                    )

                    self.server.io_loop.add_callback(self.server.close_gracefully)

                    self.write("shutting down")
                    self.set_header("Content-Type", "text/plain")

            routes.append((graceful_shutdown_endpoint, TerminationHandler))

        except Exception:
            logger.exception(
                "Dask integration failed. Continuing without graceful worker termination"
            )
        else:
            logger.info(
                "Dask integration succeeded. Graceful worker termination activated"
            )

This module is then pre-loaded into the dask-worker process.

fjetter mentioned this issue Nov 19, 2019

RFC Retire worker with periodic callback #3248

Closed

mrocklin mentioned this issue Dec 11, 2019

Data distribution guarantees for CUDA-based multi-node multi-GPU processing dask/dask#5506

Closed

mrocklin mentioned this issue Dec 18, 2019

Distributed memory management plugin #1002

Closed

fjetter mentioned this issue Sep 9, 2021

Active Memory Manager framework + discard excess replicas #5111

Merged

GenevieveBuckley added the performance label Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve recovery time in worker failure scenarios #3184

Improve recovery time in worker failure scenarios #3184

fjetter commented Oct 30, 2019 •

edited

Loading

mrocklin commented Nov 12, 2019

fjetter commented Nov 13, 2019

mrocklin commented Nov 13, 2019 via email

mrocklin commented Nov 15, 2019

StephanErb commented Nov 18, 2019 •

edited

Loading

Improve recovery time in worker failure scenarios #3184

Improve recovery time in worker failure scenarios #3184

Comments

fjetter commented Oct 30, 2019 • edited Loading

mrocklin commented Nov 12, 2019

fjetter commented Nov 13, 2019

mrocklin commented Nov 13, 2019 via email

mrocklin commented Nov 15, 2019

StephanErb commented Nov 18, 2019 • edited Loading

fjetter commented Oct 30, 2019 •

edited

Loading

StephanErb commented Nov 18, 2019 •

edited

Loading