-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Halt on provisioner errors #17359
Halt on provisioner errors #17359
Conversation
Every provisioner that uses communicator implements its own retryFunc. Take the remote-exec implementation (since it's the most complete) and put it in the communicator package for each provisioner to use. Add a public interface `communicator.Fatal`, which can wrap an error to indicate a fatal error that should not be retried.
It's now in the communicator package
It's now in the communicator package
it's now in the communicator package
it's now in the communicator package
Fix a bug where the last error was not retrieved from errVal.Load due to an incorrect type assertion.
This will let the retry loop abort when there are errors which aren't going to ever be corrected.
There no reason to retry around the execution of remote scripts. We've already established a connection, so the only that could happen here is to continually retry uploading or executing a script that can't succeed. This also simplifies the streaming output from the command, which doesn't need such explicit synchronization. Closing the output pipes is sufficient to stop the copyOutput functions, and they don't close around any values that are accessed again after the command executes.
a5495b5
to
0345d96
Compare
There was a situation where this was used (perhaps abused). On bare-metal provisions, there is typically a live image boot and then you reboot to use the disk install. There was no way for remote-exec to "know" to wait until the reboot had occurred (usually you provision once the bits are installed to disk). A common trick was to have terraform remote-exec as a user that wouldn't exist until the disk install had occurred. Instead of a fatal error, a single |
How do we now reboot a VM and wait for it to come out of the reboot? |
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
* Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * hashicorp/terraform#17359 (comment)
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Previously, provisioners would always retry on error, even when they are not recoverable, e.g. invalid credentials, file not found, etc. They then need to be interrupted, or wait for the timeout which defaults to 5min. The remote-exec retryFunc was also not returning the stored error when it was finally interrupted.
First, this moves all the retryFunc implementations to the communicator package, using the most complete (remote-exec) version as the template. This new implementation also checks of the error implements the communicator.Fatal interface, as a signal to stop retrying. While in practice "retry-able" errors are more likely the exception, signaling off fatal errors is less likely to cause regressions in the providers, since not tagging an error as fatal will only continue the existing behavior.
The ssh communicator can now tag the handshake errors as fatal, since retrying will not ever fix the situation.
We also can remote the retry logic around the remote command portion of remote-exec, since errors related to locating, uploading, and executing scripts are not likely to ever resolve themselves.