Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halt on provisioner errors #17359

Merged
merged 8 commits into from
Feb 15, 2018
Merged

Halt on provisioner errors #17359

merged 8 commits into from
Feb 15, 2018

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented Feb 15, 2018

Previously, provisioners would always retry on error, even when they are not recoverable, e.g. invalid credentials, file not found, etc. They then need to be interrupted, or wait for the timeout which defaults to 5min. The remote-exec retryFunc was also not returning the stored error when it was finally interrupted.

First, this moves all the retryFunc implementations to the communicator package, using the most complete (remote-exec) version as the template. This new implementation also checks of the error implements the communicator.Fatal interface, as a signal to stop retrying. While in practice "retry-able" errors are more likely the exception, signaling off fatal errors is less likely to cause regressions in the providers, since not tagging an error as fatal will only continue the existing behavior.

The ssh communicator can now tag the handshake errors as fatal, since retrying will not ever fix the situation.

We also can remote the retry logic around the remote command portion of remote-exec, since errors related to locating, uploading, and executing scripts are not likely to ever resolve themselves.

Every provisioner that uses communicator implements its own retryFunc.
Take the remote-exec implementation (since it's the most complete) and
put it in the communicator package for each provisioner to use.

Add a public interface `communicator.Fatal`, which can wrap an error to
indicate a fatal error that should not be retried.
It's now in the communicator package
It's now in the communicator package
it's now in the communicator package
it's now in the communicator package
Fix a bug where the last error was not retrieved from errVal.Load
due to an incorrect type assertion.
This will let the retry loop abort when there are errors which aren't
going to ever be corrected.
There no reason to retry around the execution of remote scripts. We've
already established a connection, so the only that could happen here is
to continually retry uploading or executing a script that can't succeed.

This also simplifies the streaming output from the command, which
doesn't need such explicit synchronization. Closing the output pipes is
sufficient to stop the copyOutput functions, and they don't close around
any values that are accessed again after the command executes.
@jbardin jbardin force-pushed the jbardin/provisioner-error branch from a5495b5 to 0345d96 Compare February 15, 2018 21:18
@jbardin jbardin merged commit 1c4f403 into master Feb 15, 2018
@jbardin jbardin deleted the jbardin/provisioner-error branch March 19, 2018 22:34
@dghubble
Copy link

dghubble commented Mar 27, 2018

since retrying will not ever fix the situation

There was a situation where this was used (perhaps abused). On bare-metal provisions, there is typically a live image boot and then you reboot to use the disk install. There was no way for remote-exec to "know" to wait until the reboot had occurred (usually you provision once the bits are installed to disk). A common trick was to have terraform remote-exec as a user that wouldn't exist until the disk install had occurred. Instead of a fatal error, a single terraform apply run would kindly wait until auth succeeded (or timeout).

@andrewsav-bt
Copy link

How do we now reboot a VM and wait for it to come out of the reboot?

dghubble added a commit to poseidon/typhoon that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
dghubble added a commit to poseidon/typhoon that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
dghubble added a commit to poseidon/typhoon that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
dghubble added a commit to poseidon/typhoon that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
dghubble added a commit to poseidon/typhoon that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
dghubble-robot pushed a commit to poseidon/terraform-onprem-kubernetes that referenced this pull request Apr 8, 2018
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* hashicorp/terraform#17359 (comment)
@ghost
Copy link

ghost commented Apr 3, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants