[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

tblom · 2024-06-07T14:13:15Z

Describe the bug

With flytekit/flyte-binary 1.12.0, If a python task is given the retries=n argument, an error is logged during execution:

Requests overridden on node dn1 (<FlyteLiteral(NodeMetadata) name: "<task name appears here>" retries { retries: 2 }>) without specifying limits. Requests are clamped to original limits.

More details, as described in Flyte Slack Community:

I am running interruptible workflows by setting the interruptible flag at the workflow level, which causes all tasks in the workflow to be run as interruptible (even if they have no interruptible flag specifically set) which is my intention.

This is working, but I see an error associated with each task called by this workflow, referring to the retries param for the task:

Requests overridden on node dn1 (<FlyteLiteral(NodeMetadata) name: "rf_v2_train_and_test_task" retries { retries: 2 }>) without specifying limits. Requests are clamped to original limits.

I have supplied the argument retries=2 to my tasks, because I want to ensure that if the spot instance is reclaimed, it will be retried once more on a spot instance, and finally, if it is reclaimed again, it will be run on an on-demand instance. That is my understanding of these docs :

If you set retries=n, for instance, and the task gets preempted repeatedly, Flyte will retry on a preemptible/spot instance n-1 times and for the last attempt will retry your task on a non-spot (regular) instance. Please note that tasks will only be retried if at least one retry is allowed using the retries parameter in the task decorator.

I don't understand the meaning of the error message, and what I should be doing differently. I was suggested in the Flyte Slack Community by a member of Union that this looks like a bug, and that I should file an issue.

Expected behavior

I expect to not receive the logged error message.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

runllm · 2024-06-07T14:13:35Z

The error message you are encountering seems to be related to how Flyte handles retries for interruptible tasks. According to the Flyte documentation, setting the retries parameter ensures that tasks are retried on spot instances and eventually on non-spot instances if they continue to be preempted. However, the error message you are seeing suggests there might be an issue with how resource requests and limits are being managed when retries are specified.

Would you like more detail on how to configure retries for interruptible tasks or how to handle resource requests and limits in Flyte?

Sources:

Flyte documentation on interruptible tasks

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

davidmirror-ops · 2024-08-05T21:49:23Z

@tblom do you have a reproducible example?

tblom · 2024-08-07T15:40:57Z

@davidmirror-ops

from flytekit import workflow, task, Resources

@task(retries=2)
def t1() -> int:
    return 1

@workflow(interruptible=True)
def my_wf() -> int:
    return t1().with_overrides( requests=Resources(cpu="100m", mem="100Mi") )

Here is the output of running this locally with pyflyte:

$ pyflyte --verbose run flyte_repro_5455.py my_wf
Running Execution on local.
15:32:38.130059 WARNING  Requests overridden on node n0 (<FlyteLiteral(NodeMetadata) name: "t1" retries { retries: 2 }>) without specifying       node.py:151
                         limits. Requests are clamped to original limits.                                                                                    
1

Note that the issue described only occurs when with_overrides is used in running the task. I hadn't realized this in my original post. It's still not clear to me what this warning means.

davidmirror-ops · 2024-08-21T21:47:10Z

@tblom thanks for sharing. So if you set retries=2 does it all work as expected on spot instances? Seems like the docs referenced here have an open issue. So I think there are two behaviors here

User-specified retries could end up being not respected over system retry budget on spot instances (hypothesis). Somehow the warning message could be indicating this.
When using with_overrides the warning message correctly indicates that Flyte makes requests=limits, and this is why it's better to only specify requests.

tblom added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jun 7, 2024

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Jun 13, 2024

eapolinario assigned davidmirror-ops Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

tblom commented Jun 7, 2024

runllm bot commented Jun 7, 2024

davidmirror-ops commented Aug 5, 2024

tblom commented Aug 7, 2024

davidmirror-ops commented Aug 21, 2024

[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

[BUG] python task retries=n causes "Requests overridden" error log message for interruptible task #5455

Comments

tblom commented Jun 7, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm bot commented Jun 7, 2024

davidmirror-ops commented Aug 5, 2024

tblom commented Aug 7, 2024

davidmirror-ops commented Aug 21, 2024