Add ability to provide a soft timeout-minutes #953

awoimbee · 2023-05-29T13:36:20Z

Hello!

Vote on this issue by adding a 👍 reaction
If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

I use pulumi-kubernetes, pulumi can get stuck on waiting for a deployment that can't work (eg missing container image) for a while.
A normal workflow run takes ~3mins max, a broken one takes >20mins.
I want to put a limit to the time spent on pulumi. I use timeout-minutes but it leads to a broken pulumi state (because of hard termination).

-> Please add a way to put a soft limit, where pulumi will cancel all operations and terminate gracefully.

Affected area/feature

The action itself, it could just send a SIGINT.

The text was updated successfully, but these errors were encountered:

RobbieMcKinstry · 2023-05-31T19:14:27Z

Hi @awoimbee Thanks for opening this issue. I think this problem is particular is difficult because there are many locations where timeouts can occur. e.g. at the level of the Kublet, at the level of the HTTP request from the provider, from the Pulumi engine to the provider, and for the Pulumi program itself. I'm reading your feature request as a timeout for the Pulumi program itself.

I have a few thoughts, please let me know if they're helpful for you.

I didn't immediately arrive at this, but I remembered providers like Helm and Kubernetes provide a timeout in their configuration. Here's a timeout for Kubernetes. This setting is for slow HTTP requests, so it might not handle the ImagePull failures necessarily particularly well, however...
...you might want to decrease the time it takes to fail adjusting the runtimeRequestTimeout config value for Kubelets, which controls how long they wait on image pulls. That would allow Kubernetes to determine the failure sooner and report it to the provider.
We could add a soft timeout, but I would expect that feature would first be added to the Pulumi CLI, not to the Pulumi Action. I would imagine we'd try to add a CLI flag for --timeout, then add that flag to Automation API, before finally exposing that flag in this Action.

IMO adding a timeout for the engine is rather dangerous for hard timeouts, and not necessarily useful for soft timeouts. Suppose a program creates a Kubernetes cluster on AWS; this operation can take 10-15 minutes. For hard cancellations, if the engine cancels this operation while the provider is midway through performing it, then statefile will have a false negative. The infrastructure will be created but the statefile will report it has failed and try again. For soft cancellations like you suggested, the engine will have to wait for the provider to finish ongoing operations anyway, so it will have to wait for the full 15 minutes regardless of the timeout. Therefore, I'm not sure how helpful a --timeout flag would be for users in general. For Kubernetes, it definitely makes sense, but I'm not sure how helpful it would be for other classes of use.

Putting my operator hat on, I think if I were in your shoes, I'd try to detect image pull failures sooner by decreasing the timeout at the kubelet level (adjusting runtimeRequestTimeout) link.

awoimbee · 2023-06-01T16:43:36Z

Hi, thanks for the detailed response !

I'm reading your feature request as a timeout for the Pulumi program itself.

Yes indeed, keep it simple !

[...] I remembered providers like Helm and Kubernetes provide a timeout in their configuration

[..] adjusting the runtimeRequestTimeout config value for Kubelets

These are for slow requests, won't work in my case.
They would not change for how long pulumi will wait on a "pending" deployment (that will never become ready because of ErrImagePull or a volume cannot be mounted or...).

IMO adding a timeout for the engine is rather dangerous for hard timeouts

Yes, and it's not necessary as we already have the timeout-minutes feature of github actions. And it's causing me problems (broken state)... But I still use it because I don't want to risk jobs stuck for hours (because of a bug or else).

For soft cancellations like you suggested, the engine will have to wait for the provider to finish ongoing operations anyway, [...] Therefore, I'm not sure how helpful [...] For Kubernetes, it definitely makes sense, but I'm not sure how helpful it would be for other classes of use.

Yes, my problem is definitely k8s related and a change on the pulumi k8s provider might also be able to solve my problem.
But the global issue is that when pulumi takes more time than usual, it risks hitting a hard timeout. I would like the peace of mind of eg a soft timeout at ~15 mins and a hard one at ~20 mins.

awoimbee added kind/enhancement Improvements or new features needs-triage Needs attention from the triage team labels May 29, 2023

RobbieMcKinstry added area/cicd and removed needs-triage Needs attention from the triage team labels May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to provide a soft timeout-minutes #953

Add ability to provide a soft timeout-minutes #953

awoimbee commented May 29, 2023

RobbieMcKinstry commented May 31, 2023

awoimbee commented Jun 1, 2023

Add ability to provide a soft timeout-minutes #953

Add ability to provide a soft timeout-minutes #953

Comments

awoimbee commented May 29, 2023

Hello!

Issue details

Affected area/feature

RobbieMcKinstry commented May 31, 2023

awoimbee commented Jun 1, 2023