Possible memory leak when using DaskExecutor in Kubernetes #3966

joelluijmes · 2021-01-15T15:39:58Z

Description

I’m running Prefect in Kubernetes. What I have is a flow which spawns 15 Kubernetes high CPU intense jobs. In order to do some parallelism, I have in the flow a DaskExecutor configured (6 workers, 1 thread).

What I see is the prefect-job which is created by the Kubernetes agent, uses quite some resources which grows over time. See attached screenshot.

Note: I’m talking about the job created by the prefect agent. The actual job executing code is only using 188MiB.

It seems like there is a possible memory leak in either Prefect or Dask. Is there a better alternative to deploy this? Which uses less resources?

Expected Behavior

Well first of all, I wouldn't expect this high memory usage. But most of all, not that it seems to be growing indefinitely. In this instance, the flow wasn't killed (yet) by Kubernetes due high memory usage, but that is something I was running into.

Reproduction

Reproduction is bit tricky as this flow is generated from a config file on running the script. I think I got all the relevant bits below:

executor = DaskExecutor(cluster_kwargs={"n_workers": 6, "threads_per_worker": 1, "memory_limit": "1GB"})
schedule = CronSchedule(cron, start_date=datetime.utcnow())
handler = slack_notifier(only_states=[state.Failed])
storage_gcr = Docker(
    base_image="prefecthq/prefect:0.14.1-python3.8",
    registry_url="...",
    python_dependencies=["pandas"],
)
with Flow(
    f"CloudSQL to BigQuery ({cron})",
    storage=storage,
    executor=executor,
    schedule=schedule,
    run_config=KubernetesRun(
        # prefect_job_template contains a job spec describing resource limits, secrets, nodeselector..
        job_template=prefect_job_template
    ),
) as flow:
    # Class with @resource_manager which spawns off a pod and service to run CloudSQL Proxy
    with CloudSQLProxyManager(...) as service_name:
        # Job which calls RunNamespacedJob.run() on run
        job = CreateSyncJob(
            job_spec_template,
            state_handlers=[cleanup_on_failure, handler],
        )
        # Link the dependency of task to the CloudSQLProxyManager
        task = job(service_name)

Visualizing one of the flows show:

Environment

Running Prefect Server 0.14.0 on Kubernetes 1.18.x in Google Cloud.

Originally reported on slack: https://prefect-community.slack.com/archives/CL09KU1K7/p1610706804087800

jcrist · 2021-01-15T22:18:34Z

I think dask and kubernetes are unlikely to be the culprits here, rather I suspect your CreateSyncJob (or the RunNamespacedJob) task accumulates memory over time. Since these tasks are mostly IO bound, you might try running with a LocalDaskExecutor instead to eliminate dask.distributed from consideration.

# Swapping out executor for this should do it
executor = LocalDaskExecutor(num_workers=6)

joelluijmes · 2021-01-16T20:40:47Z

I just tried it with the LocalDaskExecutor, and the initial results are much more promising. Although the job is only running for 10 minutes so far, the memory usage is much lower and actually constant.

Here is another screenshot of flow which used DaskExecutor which ran for many hours until it finally was OOMKilled:

So it does seem something may be off with dask?

--

Anyhow for reference here is my implementation of the CreateSyncJob. It takes in a templated Kubernetes job spec, which is filled in at template_job_spec (i.e. setting name of the job, environment variables etc), and then calls RunNamespacedJob.

class CreateSyncJob(Task):
    """
    Task which runs Kubernetes job which sync database from CloudSQL to BigQuery.

    Args:
      - job_spec_template (str): content of sql_to_bigquery.job.yaml.
      - spec (pd.Series): row of parse_specs_to_pandas containing sync variables.
    """

    def __init__(self, job_spec_template, spec, **kwargs):
        """
        Task which runs Kubernetes job which sync database from CloudSQL to BigQuery.

        Args:
        - job_spec_template (str): content of sql_to_bigquery.job.yaml.
        - spec (pd.Series): row of parse_specs_to_pandas containing sync variables.
        """
        self.job_spec_template = job_spec_template
        self.spec = spec

        super().__init__(
            name=spec["sourceDatabase"],
            **kwargs,
        )

    def run(self, cloudsql_name: Task):
        """
        Templates the job spec and creates the Kubernetes job using Prefect's RunNamespacedJob.

        Args:
          - cloudsql_name (Task): output of the CloudSQLProxyManager (or generate_cloudsql_name).
        """
        job_spec = template_job_spec(cloudsql_name, self.job_spec_template, self.spec)
        self.logger.info(f"Templated job_spec for {self.name}")
        self.logger.debug(job_spec)

        RunNamespacedJob(job_spec, namespace="prefect", kubernetes_api_key_secret=None).run()

Is there any benefit for using the DaskExecutor? Otherwise I'm fine with using the LocalDaskExecutor, which uses a ThreadPool instead of dask right?
if you need more information to diagnose this, I'm willing to research it 👍

jcrist · 2021-01-19T15:49:18Z

Hmmm, that's interesting. I'd still be surprised if this is a bug in dask. A larger reproducible example I could run locally would definitely help. If you drop the with CloudSQLProxyManager bit and only run some dummy jobs (say a long sleep task) are you still able to reproduce the issue? If so, that'd help me for debugging locally.

Is there any benefit for using the DaskExecutor? Otherwise I'm fine with using the LocalDaskExecutor, which uses a ThreadPool instead of dask right?

For your flow above, I'd expect using a LocalDaskExecutor with threads to be sufficient. The tasks you're running are IO bound (so will run fine in threads), and should be low in memory usage. See https://docs.prefect.io/orchestration/flow_config/executors.html#choosing-an-executor for our docs on selecting an executor to use.

novotl · 2021-05-12T09:45:34Z

👋 We have a similar setup, running Prefect in Kubernetes and using DaskExecutor for our flow with 8 workers. All workers are eating more and more memory until all of them OOM. After a bit of trial and error, we added these env vars to Dask workers yaml, which slows down the leak considerably. Credits to https://stackoverflow.com/questions/63680134/growing-memory-usage-leak-in-dask-distributed-profiler/63680548#63680548

DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms 
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms

Other related links:

github-actions · 2022-12-11T03:35:21Z

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

github-actions · 2022-12-25T04:34:31Z

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

charalamm · 2024-07-09T16:51:03Z

Hello,

I think I am seeing this issue and does not seem to be solved by using the environment variables DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL and DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE nor by using the LocalDaskExecutor.

We are using prefect 1.4

A minimal working example

from prefect import task, Flow, run_configs, storage, executors

from time import sleep
from random import randint


@task()
def mem_n_slp(x):
    sleep(x)
    # Make the task fail randomly with 5% chance
    if randint(0, 100) < 5:
        raise ValueError("Random failure")
    return x


@task()
def just_slp(x):
    sleep(5 - x)
    return x


if __name__ == "__main__":
    project_name = "test-flows"
    flow_name = "test-ram-release"
    with Flow(
        name=flow_name,
        storage=storage.Azure(
            container="container-name",
            stored_as_script=False,
            connection_string_secret="AZURE_STORAGE_CONNECTION_STRING",
        ),
        run_config=run_configs.KubernetesRun(
            image="prefecthq/prefect:1.4.0", # We are actually using another image with the same prefect version so apologies if this image does not work,
            env={
                "PREFECT__CLOUD__USE_LOCAL_SECRETS": True,
                "DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL": "10000ms",
                "DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE": "1000000ms",
            },
            image_pull_policy="Always",
            cpu_limit="0.5",
            cpu_request="0.5",
            memory_request="0.5Gi",
            memory_limit="0.5Gi",
            job_template={
                "apiVersion": "batch/v1",
                "kind": "Job",
                "spec": {
                    "ttlSecondsAfterFinished": 600,
                    "template": {
                        "metadata": {
                            "annotations": {
                                "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
                            },
                            "labels": {
                                "project-name": project_name,
                                "flow-name": flow_name,
                            },
                        },
                        "spec": {
                            "containers": [{"name": "flow"}],
                            "nodeSelector": {"prefect_label": "job_pod_node"},
                        },
                    },
                },
            }
        )
    ) as current_flow:
        A = mem_n_slp.map(1000 * [1] + 1000 * [2])
        B = mem_n_slp.map(A)
        C = mem_n_slp.map(B)
        D = mem_n_slp.map(C)
        E = mem_n_slp.map(D)
        F = just_slp.map(E)

    current_flow.executor = executors.dask.LocalDaskExecutor(num_workers=250)
    current_flow.register(project_name=project_name)

I tried doubling and making half the values of DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL and DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE but no luck.

We are observing a steady increase of memory up to the point that the container gets OOMKilled.

charalamm · 2024-07-10T07:11:10Z

When we use the flow without the KubernetesRun the flow gets completed

cicdw · 2024-07-10T15:30:55Z

Hi @charalamm - Prefect 1.x is no longer under active development; please upgrade to 2.x or 3.x.

charalamm · 2024-07-11T07:19:49Z

@cicdw Thanks for you response. Is this problem though a known problem with prefect 1.x and is there any known solution?

kvnkho mentioned this issue Sep 28, 2021

[WIP] Fix For DaskExecutor Bloating Memory #5004

Closed

3 tasks

BitTheByte mentioned this issue Jan 17, 2022

Prefect Memory Leak #5327

Closed

github-actions bot added the status:stale label Dec 11, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak when using DaskExecutor in Kubernetes #3966

Possible memory leak when using DaskExecutor in Kubernetes #3966

joelluijmes commented Jan 15, 2021 •

edited

Loading

jcrist commented Jan 15, 2021

joelluijmes commented Jan 16, 2021

jcrist commented Jan 19, 2021

novotl commented May 12, 2021

github-actions bot commented Dec 11, 2022

github-actions bot commented Dec 25, 2022

charalamm commented Jul 9, 2024 •

edited

Loading

charalamm commented Jul 10, 2024

cicdw commented Jul 10, 2024

charalamm commented Jul 11, 2024

Possible memory leak when using DaskExecutor in Kubernetes #3966

Possible memory leak when using DaskExecutor in Kubernetes #3966

Comments

joelluijmes commented Jan 15, 2021 • edited Loading

Description

Expected Behavior

Reproduction

Environment

jcrist commented Jan 15, 2021

joelluijmes commented Jan 16, 2021

jcrist commented Jan 19, 2021

novotl commented May 12, 2021

github-actions bot commented Dec 11, 2022

github-actions bot commented Dec 25, 2022

charalamm commented Jul 9, 2024 • edited Loading

charalamm commented Jul 10, 2024

cicdw commented Jul 10, 2024

charalamm commented Jul 11, 2024

joelluijmes commented Jan 15, 2021 •

edited

Loading

charalamm commented Jul 9, 2024 •

edited

Loading