Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

Open
2 tasks done
tblom opened this issue Jun 7, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@tblom
Copy link

tblom commented Jun 7, 2024

Describe the bug

With flytekit/flyte-binary 1.12.0, If a python task is given the retries=n argument, an error is logged during execution:

Requests overridden on node dn1 (<FlyteLiteral(NodeMetadata) name: "<task name appears here>" retries { retries: 2 }>) without specifying limits. Requests are clamped to original limits.

More details, as described in Flyte Slack Community:

I am running interruptible workflows by setting the interruptible flag at the workflow level, which causes all tasks in the workflow to be run as interruptible (even if they have no interruptible flag specifically set) which is my intention.

This is working, but I see an error associated with each task called by this workflow, referring to the retries param for the task:

Requests overridden on node dn1 (<FlyteLiteral(NodeMetadata) name: "rf_v2_train_and_test_task" retries { retries: 2 }>) without specifying limits. Requests are clamped to original limits.

I have supplied the argument retries=2 to my tasks, because I want to ensure that if the spot instance is reclaimed, it will be retried once more on a spot instance, and finally, if it is reclaimed again, it will be run on an on-demand instance. That is my understanding of these docs :

If you set retries=n, for instance, and the task gets preempted repeatedly, Flyte will retry on a preemptible/spot instance n-1 times and for the last attempt will retry your task on a non-spot (regular) instance. Please note that tasks will only be retried if at least one retry is allowed using the retries parameter in the task decorator.

I don't understand the meaning of the error message, and what I should be doing differently. I was suggested in the Flyte Slack Community by a member of Union that this looks like a bug, and that I should file an issue.

Expected behavior

I expect to not receive the logged error message.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@tblom tblom added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jun 7, 2024
Copy link

runllm bot commented Jun 7, 2024

The error message you are encountering seems to be related to how Flyte handles retries for interruptible tasks. According to the Flyte documentation, setting the retries parameter ensures that tasks are retried on spot instances and eventually on non-spot instances if they continue to be preempted. However, the error message you are seeing suggests there might be an issue with how resource requests and limits are being managed when retries are specified.

Would you like more detail on how to configure retries for interruptible tasks or how to handle resource requests and limits in Flyte?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Jun 13, 2024
@davidmirror-ops
Copy link
Contributor

@tblom do you have a reproducible example?

@tblom
Copy link
Author

tblom commented Aug 7, 2024

@davidmirror-ops

from flytekit import workflow, task, Resources

@task(retries=2)
def t1() -> int:
    return 1

@workflow(interruptible=True)
def my_wf() -> int:
    return t1().with_overrides( requests=Resources(cpu="100m", mem="100Mi") )

Here is the output of running this locally with pyflyte:

$ pyflyte --verbose run flyte_repro_5455.py my_wf
Running Execution on local.
15:32:38.130059 WARNING  Requests overridden on node n0 (<FlyteLiteral(NodeMetadata) name: "t1" retries { retries: 2 }>) without specifying       node.py:151
                         limits. Requests are clamped to original limits.                                                                                    
1

Note that the issue described only occurs when with_overrides is used in running the task. I hadn't realized this in my original post. It's still not clear to me what this warning means.

@davidmirror-ops
Copy link
Contributor

@tblom thanks for sharing. So if you set retries=2 does it all work as expected on spot instances? Seems like the docs referenced here have an open issue. So I think there are two behaviors here

  1. User-specified retries could end up being not respected over system retry budget on spot instances (hypothesis). Somehow the warning message could be indicating this.
  2. When using with_overrides the warning message correctly indicates that Flyte makes requests=limits, and this is why it's better to only specify requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants