-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Kubernetes tolerations #1207
Add support for Kubernetes tolerations #1207
Conversation
Looks great overall! Can you add some details about the testing you did with this change? It would make reviewing a little easier (see the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments re: code organization. @shrinandj is doing a full review.
Testing[300] @ c21ede9 |
Testing[300] @ c21ede9 PASSED |
The PR itself looks good to me. Can you confirm that at least the following scenarios have been tested:
The above tests with argo-workflows. |
@shrinandj Thanks for the review. I managed to move the input validation into the decorator. I did some quick tests, and it looks good. Could you please review my latest commits? I will provide a test report based on your input in the following days. |
I will look into these latest commits by later tonight. |
for toleration in self.attributes["tolerations"]: | ||
invalid_keys = [k for k in toleration.keys() if k not in V1Toleration.attribute_map.keys()] | ||
invalid_keys = [k for k in toleration.keys() if k not in attribute_map] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this have to change? As K8s changes, these attributes could change, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As compared to a previous commit where V1Toleration.attribute_map.keys()
was getting used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it runs on Argo, the module kubernetes
is unavailable.
I can make the validation optional, given that it is required only when the flow runs locally and KubernetesClient
raises an exception if the module is not installed. WDYT?
if self.attributes["tolerations"]:
try:
from kubernetes.client import V1Toleration
for toleration in self.attributes["tolerations"]:
invalid_keys = [k for k in toleration.keys() if k not in V1Toleration.attribute_map.keys()]
if len(invalid_keys) > 0:
raise KubernetesException(
"Tolerations parameter contains invalid keys: %s" % invalid_keys
)
except (NameError, ImportError):
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above looks like a happy compromise to me in which we try to do our best to validate AND keep up with upstream changes in K8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shrinandj I pushed the new code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one minor issue needs to be addressed. We should be good to merge and release right after.
@@ -166,6 +174,9 @@ def echo(msg, stream="stderr", job_id=None): | |||
stdout_location = ds.get_log_location(TASK_LOG_SOURCE, "stdout") | |||
stderr_location = ds.get_log_location(TASK_LOG_SOURCE, "stderr") | |||
|
|||
# `node_selector` is a tuple of strings, convert it to a dictionary | |||
node_selector = KubernetesDecorator.parse_node_selector(node_selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. At this stage, node_selector
is a tuple of strings. kubernetes.launch_job
expects a dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to handle this parsing within kubernetes_job
- the actual format is dictated by the kubernetes SDK and that's why currently all the Kubernetes-related formatting is happening within the KubernetesJob object. As the SDK evolves, any changes would be isolated to that object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_selector
contains the value generated by
@click.option(
"--node-selector",
multiple=True,
default=None,
help="NodeSelector for Kubernetes pod.",
)
which is a tuple of strings
('key'='val','foo=bar')
kubernetes_job
expect a dictionary like
{
"key": "val",
"foo": "bar",
}
parse_node_selector
converts the tuple of strings to a dictionary compatible with the Kubernetes SDK.
Does it make sense?
# cased in kubernetes_client.py | ||
if self.attributes["tolerations"]: | ||
try: | ||
from kubernetes.client import V1Toleration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the rationale for including this check in _init_
is to ensure that this check is invoked for argo-workflows
too. However, this check will fail if the user hasn't installed the python package kubernetes
yet - which is checked in package_init
- that check should technically happen before the check for tolerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is checked in kubernetes_cli
.
The idea is that the check is invoked only if required, which means the python package kubernetes
must be installed. I think the order of the execution doesn't matter.
If kubernetes
is not available, self.attributes["tolerations"]
is not being used, then the check is not required.
Does it make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the user pip installs metaflow, we don't install Kubernetes python package. It's only when the user starts executing a flow that involves @kubernetes
or argo - we throw a nice warning asking them to install the python package. Now, if that first flow has tolerations
defined, then the user will instead get an error saying no module named Kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That import is inside a try
block with
except (NameError, ImportError):
pass
It should not raise any errors related to the missing module, is it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but it's the round about way this check is implemented which is my concern. We can ship this and come back to clean it up.
also, you might want to appease |
@shrinandj @savingoyal I improved the error handling in a1711b6 |
@shrinandj Tests done, I've updated the description of this PR. |
I just realized that this PR would be a great reference for implementing some of the other features for K8s support (e.g. volume support). |
@shrinandj @savingoyal can you please approve the GitHub workflow? |
This seems to be the only documentation for Metaflow's Kubernetes tolerations support, so I'll add my note here. To allow a Metaflow to run on an Azure Spot node pool in AKS, add this to your Metaflow config.json: "METAFLOW_KUBERNETES_TOLERATIONS":"[{\"key\":\"kubernetes.azure.com/scalesetpriority\",\"value\":\"spot\",\"effect\":\"PreferNoSchedule\"},{\"key\":\"kubernetes.azure.com/scalesetpriority\",\"value\":\"spot\",\"effect\":\"NoSchedule\"}]" This will allow flows to prefer running on a spot instance, but fallback to a more expensive node pool when your spot instance is not available. |
This PR adds support for
tolerations
in the Kubernetes and Argo plugins.Testing Done:
Using the script
flow.py
Tested the following commands:
python3 flow.py run --with kubernetes
python3 flow.py run --with kubernetes:node_selector=app=cpu
python3 flow.py run --with kubernetes:node_selector=app=cpu,tolerations='[{"key":"app","value":"cpu","effect":"NoSchedule"}]'
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' python3 flow.py run --with kubernetes
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' METAFLOW_KUBERNETES_NODE_SELECTOR='app=cpu' python3 flow.py run --with kubernetes
METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"app","value":"cpu","effect":"NoSchedule"}]' METAFLOW_KUBERNETES_NODE_SELECTOR='app=cpu' python3 flow.py argo-workflows create
python3 flow.py argo-workflows trigger
With the script
flow_decorator.py
python3 flow_decorator.py run
python3 flow_decorator.py argo-workflows create
python3 flow_decorator.py argo-workflows trigger
In all the executions, verified that the pods have the expected
Tolerations
andnodeSelectors