Add support for Kubernetes tolerations #1207

odracci · 2022-12-05T18:04:51Z

This PR adds support for tolerations in the Kubernetes and Argo plugins.

Testing Done:

Using the script flow.py

DEFAULT_RESOURCES = {"cpu": "1", "gpu": "0", "memory": "1"}

from metaflow import (
    FlowSpec,
    step,
    timeout,
    kubernetes,
    resources,
)


class TestTolerationsFlow(FlowSpec):
    @resources(**DEFAULT_RESOURCES)
    @step
    def start(self):
        """
        This is the 'start' step. All flows must have a step named 'start' that
        is the first step in the flow.

        """
        self.next(self.end)

    @resources(**DEFAULT_RESOURCES)
    @step
    def end(self):
        """
        This is the 'end' step. All flows must have an 'end' step, which is the
        last step in the flow.

        """
        print("TestTolerationsFlow is all done.")


if __name__ == "__main__":
    TestTolerationsFlow()

Tested the following commands:

python3 flow.py run --with kubernetes
python3 flow.py run --with kubernetes:node_selector=app=cpu
python3 flow.py run --with kubernetes:node_selector=app=cpu,tolerations='[{"key":"app","value":"cpu","effect":"NoSchedule"}]'
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' python3 flow.py run --with kubernetes
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' METAFLOW_KUBERNETES_NODE_SELECTOR='app=cpu' python3 flow.py run --with kubernetes
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' METAFLOW_KUBERNETES_NODE_SELECTOR='app=cpu' python3 flow.py argo-workflows create
python3 flow.py argo-workflows trigger

With the script flow_decorator.py

DEFAULT_RESOURCES = {"cpu": "1", "gpu": "0", "memory": "1"}
NODE_GROUP = "cpu"
KUBERNETES_NODE_SELECTOR = f"app={NODE_GROUP}"
KUBERNETES_TOLERATIONS = [{"key": "app", "effect": "NoSchedule", "value": NODE_GROUP}]

from metaflow import (
    FlowSpec,
    step,
    timeout,
    kubernetes,
    resources,
)


class TestTolerationsFlow(FlowSpec):
    @resources(**DEFAULT_RESOURCES)
    @kubernetes(node_selector=KUBERNETES_NODE_SELECTOR, tolerations=KUBERNETES_TOLERATIONS)
    @step
    def start(self):
        """
        This is the 'start' step. All flows must have a step named 'start' that
        is the first step in the flow.

        """
        self.next(self.end)

    @resources(**DEFAULT_RESOURCES)
    @kubernetes(node_selector=KUBERNETES_NODE_SELECTOR, tolerations=KUBERNETES_TOLERATIONS)
    @step
    def end(self):
        """
        This is the 'end' step. All flows must have an 'end' step, which is the
        last step in the flow.

        """
        print("TestTolerationsFlow is all done.")


if __name__ == "__main__":
    TestTolerationsFlow()

python3 flow_decorator.py run
python3 flow_decorator.py argo-workflows create
python3 flow_decorator.py argo-workflows trigger

In all the executions, verified that the pods have the expected Tolerations and nodeSelectors

metaflow/plugins/argo/argo_workflows.py

shrinandj · 2022-12-05T19:20:24Z

Looks great overall! Can you add some details about the testing you did with this change? It would make reviewing a little easier (see the Testing Done section in this PR as an example)

savingoyal

some comments re: code organization. @shrinandj is doing a full review.

metaflow/plugins/argo/argo_workflows.py

metaflow/plugins/kubernetes/kubernetes_job.py

nflx-mf-bot · 2022-12-05T19:23:43Z

Testing[300] @ c21ede9

metaflow/plugins/kubernetes/kubernetes_cli.py

metaflow/metaflow_config.py

metaflow/plugins/kubernetes/kubernetes_decorator.py

nflx-mf-bot · 2022-12-05T21:59:02Z

Testing[300] @ c21ede9 PASSED

metaflow/plugins/kubernetes/kubernetes_job.py

metaflow/plugins/kubernetes/kubernetes_decorator.py

shrinandj · 2022-12-06T19:18:31Z

The PR itself looks good to me. Can you confirm that at least the following scenarios have been tested:

Basic test for tolerations. Create a flow --with kubernetes:tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}]. It should run successfully.
Basic test for node_selectors since that code was also touched. Create a flow --with kubernetes:tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}],node_selector="mynode=mylabel". It should run successfully.
Ensuring other options haven't regressed. Create a flow --with kubernetes:cpu=2,tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}]. It should run successfully.
Ensuring that default K8s backend works as expected with just... --with kubernetes

The above tests with argo-workflows.
And some basic smoke tests WITHOUT kubernetes or Argo.

metaflow/plugins/kubernetes/kubernetes_decorator.py

odracci · 2022-12-07T17:23:18Z

@shrinandj Thanks for the review. I managed to move the input validation into the decorator. I did some quick tests, and it looks good. Could you please review my latest commits?

I will provide a test report based on your input in the following days.

shrinandj · 2022-12-08T23:28:54Z

Could you please review my latest commits?

I will look into these latest commits by later tonight.

shrinandj · 2022-12-09T00:46:13Z

metaflow/plugins/kubernetes/kubernetes_decorator.py

            for toleration in self.attributes["tolerations"]:
-                invalid_keys = [k for k in toleration.keys() if k not in V1Toleration.attribute_map.keys()]
+                invalid_keys = [k for k in toleration.keys() if k not in attribute_map]


Why did this have to change? As K8s changes, these attributes could change, right?

(As compared to a previous commit where V1Toleration.attribute_map.keys() was getting used)

When it runs on Argo, the module kubernetes is unavailable.
I can make the validation optional, given that it is required only when the flow runs locally and KubernetesClient raises an exception if the module is not installed. WDYT?

if self.attributes["tolerations"]: try: from kubernetes.client import V1Toleration for toleration in self.attributes["tolerations"]: invalid_keys = [k for k in toleration.keys() if k not in V1Toleration.attribute_map.keys()] if len(invalid_keys) > 0: raise KubernetesException( "Tolerations parameter contains invalid keys: %s" % invalid_keys ) except (NameError, ImportError): pass

The above looks like a happy compromise to me in which we try to do our best to validate AND keep up with upstream changes in K8s.

@shrinandj I pushed the new code

shrinandj

LGTM! Great work!!

savingoyal

LGTM! Just one minor issue needs to be addressed. We should be good to merge and release right after.

savingoyal · 2022-12-14T15:19:21Z

metaflow/plugins/kubernetes/kubernetes_cli.py

@@ -166,6 +174,9 @@ def echo(msg, stream="stderr", job_id=None):
    stdout_location = ds.get_log_location(TASK_LOG_SOURCE, "stdout")
    stderr_location = ds.get_log_location(TASK_LOG_SOURCE, "stderr")

+    # `node_selector` is a tuple of strings, convert it to a dictionary
+    node_selector = KubernetesDecorator.parse_node_selector(node_selector)


is this needed anymore?

Yes, it is. At this stage, node_selector is a tuple of strings. kubernetes.launch_job expects a dictionary

Any reason not to handle this parsing within kubernetes_job - the actual format is dictated by the kubernetes SDK and that's why currently all the Kubernetes-related formatting is happening within the KubernetesJob object. As the SDK evolves, any changes would be isolated to that object.

node_selector contains the value generated by

@click.option( "--node-selector", multiple=True, default=None, help="NodeSelector for Kubernetes pod.", )

which is a tuple of strings

('key'='val','foo=bar')

kubernetes_job expect a dictionary like

{ "key": "val", "foo": "bar", }

parse_node_selector converts the tuple of strings to a dictionary compatible with the Kubernetes SDK.
Does it make sense?

savingoyal · 2022-12-14T15:24:17Z

metaflow/plugins/kubernetes/kubernetes_decorator.py

-        #       cased in kubernetes_client.py
+        if self.attributes["tolerations"]:
+            try:
+                from kubernetes.client import V1Toleration


I understand that the rationale for including this check in _init_ is to ensure that this check is invoked for argo-workflows too. However, this check will fail if the user hasn't installed the python package kubernetes yet - which is checked in package_init - that check should technically happen before the check for tolerations.

It is checked in kubernetes_cli.
The idea is that the check is invoked only if required, which means the python package kubernetes must be installed. I think the order of the execution doesn't matter.
If kubernetes is not available, self.attributes["tolerations"] is not being used, then the check is not required.
Does it make sense to you?

When the user pip installs metaflow, we don't install Kubernetes python package. It's only when the user starts executing a flow that involves @kubernetes or argo - we throw a nice warning asking them to install the python package. Now, if that first flow has tolerations defined, then the user will instead get an error saying no module named Kubernetes

That import is inside a try block with

except (NameError, ImportError): pass

It should not raise any errors related to the missing module, is it correct?

Yes - but it's the round about way this check is implemented which is my concern. We can ship this and come back to clean it up.

savingoyal · 2022-12-14T15:26:54Z

also, you might want to appease black - https://github.com/Netflix/metaflow/actions/runs/3661106684/jobs/6259649285

odracci · 2022-12-14T18:32:09Z

@shrinandj @savingoyal I improved the error handling in a1711b6

odracci · 2022-12-14T18:40:41Z

The PR itself looks good to me. Can you confirm that at least the following scenarios have been tested:

Basic test for tolerations. Create a flow --with kubernetes:tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}]. It should run successfully.

Basic test for node_selectors since that code was also touched. Create a flow --with kubernetes:tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}],node_selector="mynode=mylabel". It should run successfully.

Ensuring other options haven't regressed. Create a flow --with kubernetes:cpu=2,tolerations=[{"key": "arch", "operator": "Equal", "value": "amd"}]. It should run successfully.

Ensuring that default K8s backend works as expected with just... --with kubernetes

The above tests with argo-workflows. And some basic smoke tests WITHOUT kubernetes or Argo.

@shrinandj Tests done, I've updated the description of this PR.

shrinandj · 2022-12-19T16:20:33Z

I just realized that this PR would be a great reference for implementing some of the other features for K8s support (e.g. volume support).

odracci · 2022-12-19T18:27:29Z

@shrinandj @savingoyal can you please approve the GitHub workflow?

bbrandt · 2023-02-27T21:50:25Z

This seems to be the only documentation for Metaflow's Kubernetes tolerations support, so I'll add my note here.

To allow a Metaflow to run on an Azure Spot node pool in AKS, add this to your Metaflow config.json:

"METAFLOW_KUBERNETES_TOLERATIONS":"[{\"key\":\"kubernetes.azure.com/scalesetpriority\",\"value\":\"spot\",\"effect\":\"PreferNoSchedule\"},{\"key\":\"kubernetes.azure.com/scalesetpriority\",\"value\":\"spot\",\"effect\":\"NoSchedule\"}]"

This will allow flows to prefer running on a spot instance, but fallback to a more expensive node pool when your spot instance is not available.

Add support for Kubernetes tolerations

c58451e

odracci commented Dec 5, 2022

View reviewed changes

metaflow/plugins/argo/argo_workflows.py Show resolved Hide resolved

Revert get_docker_registry import

c21ede9

savingoyal requested changes Dec 5, 2022

View reviewed changes

metaflow/plugins/argo/argo_workflows.py Show resolved Hide resolved

metaflow/plugins/argo/argo_workflows.py Outdated Show resolved Hide resolved

metaflow/plugins/argo/argo_workflows.py Show resolved Hide resolved

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

savingoyal added ok-to-test testable labels Dec 5, 2022

nflx-mf-bot removed the testable label Dec 5, 2022

shrinandj reviewed Dec 5, 2022

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_cli.py Show resolved Hide resolved

shrinandj reviewed Dec 5, 2022

View reviewed changes

metaflow/metaflow_config.py Outdated Show resolved Hide resolved

shrinandj reviewed Dec 5, 2022

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_decorator.py Outdated Show resolved Hide resolved

nflx-mf-bot added the mergeable label Dec 5, 2022

odracci added 3 commits December 5, 2022 23:06

Fix KUBERNETES_TOLERATIONS default value

8fc5812

Update toleration example

a3c749f

Remove KUBERNETES_NODE_SELECTOR in kubernetes_job.py

af7aef8

odracci commented Dec 5, 2022

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_job.py Outdated Show resolved Hide resolved

savingoyal reviewed Dec 5, 2022

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_decorator.py Outdated Show resolved Hide resolved

odracci added 3 commits December 6, 2022 00:35

Fix black code style

d1c6f85

Add param doc to KubernetesDecorator

6f6dbee

Fix typo

bc39dde

odracci added 6 commits December 7, 2022 10:42

Serialize tolerations in runtime_step_cli

165dcd3

Fix KUBERNETES_TOLERATIONS config

b00048a

Fix node_selector env var in the kubernetes decorator

9cdb7ee

JSON loads KUBERNETES_TOLERATIONS in kubernetes_decorator init

fa6e10a

Parse node_selector and tolerations in the decorator

fd9d1a4

Update comment

b8eb851

odracci commented Dec 7, 2022

View reviewed changes

metaflow/plugins/kubernetes/kubernetes_decorator.py Outdated Show resolved Hide resolved

odracci added 2 commits December 7, 2022 16:51

Validate tolerations object in kubernetes_decorator.py

401e93d

Use hard coded tolerations attribute_map

16a6734

odracci added 2 commits December 9, 2022 00:00

Fix black code style

d6992b2

String formatting compatible with python 3.5

53654e8

shrinandj reviewed Dec 9, 2022

View reviewed changes

Use V1Toleration.attribute_map to validate tolerations

cb3b37d

shrinandj previously approved these changes Dec 12, 2022

View reviewed changes

savingoyal reviewed Dec 14, 2022

View reviewed changes

Fix black lint

d61d717

odracci dismissed shrinandj’s stale review via d61d717 December 14, 2022 16:11

Improve error handling

a1711b6

Fix black lint

03e79c4

shrinandj approved these changes Dec 19, 2022

View reviewed changes

savingoyal approved these changes Dec 21, 2022

View reviewed changes

savingoyal merged commit 97a5ea5 into Netflix:master Dec 21, 2022

odracci deleted the support-for-kubernetes-tolerations branch December 29, 2022 21:45

shrinandj mentioned this pull request Feb 13, 2023

Add config option for lifetime of Kubernetes jobs #1269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Kubernetes tolerations #1207

Add support for Kubernetes tolerations #1207

odracci commented Dec 5, 2022 •

edited

Loading

shrinandj commented Dec 5, 2022

savingoyal left a comment

nflx-mf-bot commented Dec 5, 2022

nflx-mf-bot commented Dec 5, 2022

shrinandj commented Dec 6, 2022

odracci commented Dec 7, 2022

shrinandj commented Dec 8, 2022

shrinandj Dec 9, 2022

shrinandj Dec 9, 2022

odracci Dec 9, 2022

shrinandj Dec 9, 2022

odracci Dec 12, 2022

shrinandj left a comment

savingoyal left a comment

savingoyal Dec 14, 2022

odracci Dec 14, 2022

savingoyal Dec 14, 2022

odracci Dec 14, 2022

savingoyal Dec 14, 2022

odracci Dec 14, 2022

savingoyal Dec 14, 2022

odracci Dec 14, 2022

savingoyal Dec 21, 2022

savingoyal commented Dec 14, 2022

odracci commented Dec 14, 2022

odracci commented Dec 14, 2022

shrinandj commented Dec 19, 2022

odracci commented Dec 19, 2022

bbrandt commented Feb 27, 2023

Add support for Kubernetes tolerations #1207

Add support for Kubernetes tolerations #1207

Conversation

odracci commented Dec 5, 2022 • edited Loading

shrinandj commented Dec 5, 2022

savingoyal left a comment

Choose a reason for hiding this comment

nflx-mf-bot commented Dec 5, 2022

nflx-mf-bot commented Dec 5, 2022

shrinandj commented Dec 6, 2022

odracci commented Dec 7, 2022

shrinandj commented Dec 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrinandj left a comment

Choose a reason for hiding this comment

savingoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

savingoyal commented Dec 14, 2022

odracci commented Dec 14, 2022

odracci commented Dec 14, 2022

shrinandj commented Dec 19, 2022

odracci commented Dec 19, 2022

bbrandt commented Feb 27, 2023

odracci commented Dec 5, 2022 •

edited

Loading