Halt on provisioner errors #17359

jbardin · 2018-02-15T21:17:44Z

Previously, provisioners would always retry on error, even when they are not recoverable, e.g. invalid credentials, file not found, etc. They then need to be interrupted, or wait for the timeout which defaults to 5min. The remote-exec retryFunc was also not returning the stored error when it was finally interrupted.

First, this moves all the retryFunc implementations to the communicator package, using the most complete (remote-exec) version as the template. This new implementation also checks of the error implements the communicator.Fatal interface, as a signal to stop retrying. While in practice "retry-able" errors are more likely the exception, signaling off fatal errors is less likely to cause regressions in the providers, since not tagging an error as fatal will only continue the existing behavior.

The ssh communicator can now tag the handshake errors as fatal, since retrying will not ever fix the situation.

We also can remote the retry logic around the remote command portion of remote-exec, since errors related to locating, uploading, and executing scripts are not likely to ever resolve themselves.

Every provisioner that uses communicator implements its own retryFunc. Take the remote-exec implementation (since it's the most complete) and put it in the communicator package for each provisioner to use. Add a public interface `communicator.Fatal`, which can wrap an error to indicate a fatal error that should not be retried.

It's now in the communicator package

it's now in the communicator package

Fix a bug where the last error was not retrieved from errVal.Load due to an incorrect type assertion.

This will let the retry loop abort when there are errors which aren't going to ever be corrected.

There no reason to retry around the execution of remote scripts. We've already established a connection, so the only that could happen here is to continually retry uploading or executing a script that can't succeed. This also simplifies the streaming output from the command, which doesn't need such explicit synchronization. Closing the output pipes is sufficient to stop the copyOutput functions, and they don't close around any values that are accessed again after the command executes.

dghubble · 2018-03-27T06:46:22Z

since retrying will not ever fix the situation

There was a situation where this was used (perhaps abused). On bare-metal provisions, there is typically a live image boot and then you reboot to use the disk install. There was no way for remote-exec to "know" to wait until the reboot had occurred (usually you provision once the bits are installed to disk). A common trick was to have terraform remote-exec as a user that wouldn't exist until the disk install had occurred. Instead of a fatal error, a single terraform apply run would kindly wait until auth succeeded (or timeout).

andrewsav-bt · 2018-04-06T00:58:38Z

How do we now reboot a VM and wait for it to come out of the reboot?

* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)

ghost · 2020-04-03T02:36:00Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin added 8 commits February 14, 2018 18:18

remove retryFunc

f5b8091

It's now in the communicator package

remove retryFunc

89a0ac6

It's now in the communicator package

remove retryFunc

d02250c

it's now in the communicator package

remove retryFunc

e331ae9

it's now in the communicator package

Fix type assertion when loading stored error

e06f76b

Fix a bug where the last error was not retrieved from errVal.Load due to an incorrect type assertion.

have the ssh communicator return fatal errors

c1b35ad

This will let the retry loop abort when there are errors which aren't going to ever be corrected.

jbardin requested a review from apparentlymart February 15, 2018 21:17

jbardin force-pushed the jbardin/provisioner-error branch from a5495b5 to 0345d96 Compare February 15, 2018 21:18

apparentlymart approved these changes Feb 15, 2018

View reviewed changes

jbardin merged commit 1c4f403 into master Feb 15, 2018

jbardin deleted the jbardin/provisioner-error branch March 19, 2018 22:34

dghubble mentioned this pull request Apr 8, 2018

Fix bare-metal multiple apply/ssh on Terraform v0.11.4+ poseidon/typhoon#181

Merged

ghost locked and limited conversation to collaborators Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Halt on provisioner errors #17359

Halt on provisioner errors #17359

jbardin commented Feb 15, 2018

dghubble commented Mar 27, 2018 •

edited

Loading

andrewsav-bt commented Apr 6, 2018

ghost commented Apr 3, 2020

Halt on provisioner errors #17359

Halt on provisioner errors #17359

Conversation

jbardin commented Feb 15, 2018

dghubble commented Mar 27, 2018 • edited Loading

andrewsav-bt commented Apr 6, 2018

ghost commented Apr 3, 2020

dghubble commented Mar 27, 2018 •

edited

Loading