Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can no longer reboot and continue. #17844

Open
AndrewSav opened this issue Apr 11, 2018 · 47 comments
Open

Can no longer reboot and continue. #17844

AndrewSav opened this issue Apr 11, 2018 · 47 comments

Comments

@AndrewSav
Copy link

AndrewSav commented Apr 11, 2018

Hi there,
Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

I posted on the Google Group and did not get any response. The gitter chat is full of questions and no answers.

In the absence of other avenue to get a question answered I'm positing it here.

How do I reboot-and-continue with terraform? In version 0.11.3 it was possible to issue the reboot command in the shell provisioner, and when the machine comes out of the reboot, the next provisioner in the file would re-connect and continue.

Since 0.11.4 this is no longer working. When machine goes to reboot the terraform will error out and provisioning would stop.

How is this supposed to work when set up correctly?

@apparentlymart
Copy link
Contributor

Hi @AndrewSav!

In Terraform 0.11.4 there was a change to try to make Terraform detect and report certain error conditions, rather than retrying indefinitely. Unfortunately this change was found to be a little too sensitive, so e.g. if sshd starts up before the authorized_keys file has been populated by cloud-init then Terraform would fail with an authentication error, rather than retrying. I think this may be the root cause of your problem here.

In 0.11.6 (#17744) this behavior was refined to treat authentication errors as retryable to support situations where sshd is running before credentials are fully populated. Could you try this with version 0.11.6 or later and see if that fixes the problem for you?

@jwadolowski
Copy link

jwadolowski commented Apr 17, 2018

I just encountered similar problem. Reboot needs to be triggered during initial setup of EC2 instance. To do that I'm using remote-exec inside null_resource:

resource "null_resource" "yum-update" {
  triggers {
    instance_id = "${aws_instance.webapp.id}"
  }

  connection = {
    type         = "ssh"
    user         = "${var.ssh_user}"
    host         = "${aws_instance.webapp.private_ip}"
    private_key  = "${file(var.ssh_key_path)}"
    bastion_host = "${var.ssh_use_bastion == true ? var.ssh_bastion_host : ""}"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo yum update -y",
      "sudo reboot",
    ]
  }

  depends_on = [
    "aws_volume_attachment.webapp-ebs-att",
  ]
}

Terraform fails with the following message:

Error: Error applying plan:

1 error(s) occurred:

* module.xyz.null_resource.yum-update: error executing "/tmp/terraform_1226926016.sh": wait: remote command exited without exit status or exit signal

It's definitely not related to authorized_keys race condition, as yum update -y got executed without issues. Exactly the same code was working just fine with previous Terraform versions.

Terraform version:

$ terraform -v
Terraform v0.11.7
+ provider.aws v1.14.1
+ provider.null v1.0.0
+ provider.template v1.0.0

@apparentlymart
Copy link
Contributor

It looks like in cfa299d we upgraded our vendored version of the Go SSH library to a newer version that added that error message, but that went out in v0.8.5 (over a year ago) and so cannot be the culprit for a recently-introduced issue.

The error seems to indicate that the SSH server closed the connection without reporting the result of the command, as described in RFC 4254 section 6.10, which I suppose could make sense if the sshd process were killed before reboot returned. I assume that prior to Terraform v0.11.4 this error was still occurring but being silently ignored.

The tricky thing here is that arguably the new behavior is more correct since the SSH execution is failing (it's not completing fully) and so therefore Terraform should not proceed and assume the instance is fully provisioned in this case... there are other reasons why the connection might be shut down that would not be safe to continue.

Perhaps we can make a compromise here and add an option to the provisioner to treat this particular situation as a success, for situations where either the SSH server is being restarted or the system itself is being shut down. I'm not sure what is the best way to describe that situation to make an intuitive option, though: allow_missing_exit_status is the most directly descriptive, but doesn't really get at the intent so if we went with that option I suppose configuration authors would need to annotate it with a comment explaining why:

  provisioner "remote-exec" {
    inline = [
      "sudo yum update -y",
      "sudo reboot",
    ]

    # sshd may exit before "sudo reboot" completes, preventing it from
    # returning the script's exit status.
    allow_missing_exit_status = true
  }

@lamont
Copy link

lamont commented Apr 17, 2018

Adding a allow_missing_exit_status = true feature would work for me. I'm perfectly prepared to admit that rebooting during a provisioning is weird and call it out with a flag and a comment. As it is now, I'm falling back to tf 0.11.3 to keep working cause some of my fleet depend on the reboot before the next provisioner can continue. Thanks for looking at it.

@apparentlymart
Copy link
Contributor

The Terraform team at HashiCorp won't be able to work on this in the near future due to our focus being elsewhere, but we'd be happy to review a pull request if someone has the time and motivation to implement it.

Otherwise, we should be able to take a look at it once we've completed some other work in progress on the configuration language, which is likely to be at least a few months away.

I'm sorry for this unintended change in behavior. As an alternative to staying on 0.11.3, it might be possible to arrange for a necessary reboot operation to happen asynchronously so that the provisioner is able to complete successfully before it begins. For example, perhaps using the shutdown command with a non-now time would do the trick. If there are other subsequent provisioning steps it may be necessary to take some additional steps to ensure that the next provisioner won't connect before the reboot begins, such as revoking the authorized SSH key with some mechanism to re-install it after the reboot has completed.

@AndrewSav
Copy link
Author

@apparentlymart apologies, I'm on holiday until 26th of April and don't have access to the required infrastructure to test this until then. I'll make sure to test and report back when I've returned from holiday.

@haxorof
Copy link

haxorof commented Apr 18, 2018

I also encountered this problem when I wanted to trigger a reboot in a null_resource. For me it helped to just add & so now it looks like this for me:

provisioner "remote-exec" {
  inline = [
    "sudo reboot &",
  ]
}

Not completely verified that it works all the time but so far it has.

@AndrewSav
Copy link
Author

@haxorof probably depends on flavor of linux. For what ever reason it did not work from terrafrorm with rancherOS for me (did not cause a reboot). Although from command line it of course works. So I still think it's affected by terraform interaction.

@apparentlymart
Copy link
Contributor

I think the & solution for backgrounding might be a little tricky because the sudo process still remains attached to the shell while it's running and so sshd shutting down may also send a signal to sudo, and thus in turn to reboot, and so kill it before it gets a chance to compete.

My thought about using the shutdown command above is that it's implemented in a way where the actual shutdown is managed by a background process, and so the shutdown command completes immediately, allowing the shell to exit before the shutdown begins. In the case of a systemd system, for example, I believe (IIRC) that a timed shutdown is handled by sending a message to logind, which then itself coordinates the shutdown. Since logind is a system daemon, it is not connected to your SSH session.

@lamont
Copy link

lamont commented Apr 24, 2018

Just a followup, we implemented the suggestion from @haxorof (reboot &) and it's worked perfectly on ubuntu 16.04 so far. I was going to use shutdown -r +1 plus a local-exec sleep 60 but was bummed that I'd be adding a minute to every instance creation. If I could pass a sub-minute timeout to shutdown I'd have done that, but till then I'll keep with the backgrounded reboot till we run into issues with it.

@haxorof
Copy link

haxorof commented Apr 24, 2018

@AndrewSav : Yes you are right. I tested on an Ubuntu 17.10 and now tried it on a FreeBSD. It seems that the reboot & workaround does not work with the FreeBSD version I tried.

@jbardin
Copy link
Member

jbardin commented Apr 26, 2018

Rather than using a time argument to shutdown, you could delay the reboot in a subshell.

(sleep 2 && reboot)&

@andrewsav-bt
Copy link

andrewsav-bt commented May 3, 2018

@apparentlymart do you think a "remote-reboot" provisioner is appropriate?
@jbardin - wow, thank you so much! That actually worked for me! I'm guessing in the presence of a workable workaround this is a less of an issue now.

Guys would you like me to close the issue?

@jbardin
Copy link
Member

jbardin commented May 4, 2018

I think since this seems to be a common enough issue for users, we should consider making it part of the provisioner itself. I don't think we need another provisioner altogether, since this is just a special case of remote-exec. Having a special field like shutdown_command would be fairly easy to add, and that command could just ignore a connection failure after execution.

@pasikarkkainen
Copy link

Hitting this issue aswell. For a temporary workaround this seems to work for me (as mentioned earlier by others):

(sleep 5 && reboot)&

@karl-barbour
Copy link

karl-barbour commented Aug 23, 2018

The above background reboots don't appear to be working for me on Ubuntu 18-04.

Any news on this as a provisioner feature, similar to Packer's windows restart? https://www.packer.io/docs/provisioners/windows-restart.html

EDIT:

Using the following workaround (a local-exec provisioner)

  provisioner "local-exec" {
    command = "ssh -o 'StrictHostKeyChecking no' -i ${var.pem_file_path} root@${digitalocean_droplet.web.ipv4_address} '(sleep 2; reboot)&'"
  }

@GMZwinge
Copy link

A similar issue exists on Windows with WinRM. A workaround that works for us is a remote-exec provisioner like this:

  provisioner "remote-exec" {
    inline = [
      "shutdown /r /t 5",
      "net stop WinRM",
    ]
    ...
  }

The first command schedules the reboot a few seconds later. It avoids the shutdown to sometime kill thenet stop WinRM. The second command makes sure that the next provisioner doesn't connect, while the machine is shutting down, and then fail. This can happen sometime even without a shutdown delay: shutdown /r /t 0. A separate remote-exec provisioner ensures that the output of the previous remote-exec provisioner is flushed.

@chakatz
Copy link

chakatz commented Dec 12, 2018

This did not work for me with remote-exec:
"(sleep 2 && sudo reboot)&",
It didn't cause an error but it also didn't actually do a reboot.

So instead I tried this and it is working fine and would of course work for any OS.

  provisioner "local-exec" {
    command = "aws ec2 reboot-instances --instance-ids ${self.id}"
  }

@mohamaa
Copy link

mohamaa commented Dec 13, 2018

@chakatz Nice workaround, Though Terraform should be working for any reboot in between the terraform run.
I am using v0.11.10 now, still the same issue.

Terraform Please provide a solution to it at the earliest.

@frafra
Copy link

frafra commented Dec 15, 2018

Alternative workaround: shutdown -r +0

@AndrewSav
Copy link
Author

@frafra did you try it yourself? Because that's exactly what's not working.

@frafra
Copy link

frafra commented Dec 15, 2018

@AndrewSav yes, sure, but this is a different syntax, and it works just fine for me, while with reboot, systemctl reboot and (sleep 3 && reboot) & do not. shutdown -r +0 still exits before restarting, so Terraform does not halts.

Here is my script: https://github.com/frafra/fedora-atomic-hetzner/blob/master/fedora-atomic-hetzner.tf

@djoos
Copy link
Contributor

djoos commented Jan 9, 2019

Hi all,

after some frustration, it seems I'm able to run with Terraform 0.11.11, but it definitely feels hacky though; having 1 null_resource with 3 provisioners (FYI: Windows instance provisioning):

provisioner "chef"  {
  # handles pre-reboot config mngmt; completes cleanly; schedules a delayed reboot
}

# see https://github.com/hashicorp/terraform/issues/17844#issuecomment-422960337 (above)
# `[remote-exec]: error during provision, continue requested` (see "on_failure" below)
provisioner "remote-exec" {
  inline = [
    "shutdown /r /f /t 5 /c "forced reboot",
    "net stop WinRM"
  ]
  # Terraform > v0.11.3 will fail if the provisioner doesn't report the exit status, but here we'll explicitly allow failure
  on_failure = "continue"
}

provisioner "chef"  {
  # handles post-reboot config mngmt
}

More advanced testing still in progress, but initial tests seem fine...

I guess in an ideal scenario, I'd like the Chef run to exit with code 35 or 37, but then the Terraform Chef provisioner to allow that to happen, reconnect and then pick up and complete the provisioning.
Perhaps not dissimilar to kitchen, using retry_on_exit_code (an array of exit codes to indicate that kitchen should retry the converge command) and max_retries (number of times to retry the converge before passing along the failed status).

Happy to get stuck in with a few more pointers on the Terraform internals - thanks in advance for your feedback!

clayshek added a commit to clayshek/terraform-raspberrypi-bootstrap that referenced this issue Feb 8, 2019
Changing 'sudo reboot' to 'sudo shutdown -r +0' to address exit status issue encountered after Terraform 0.11.3, see hashicorp/terraform#17844
@palfaiate
Copy link

  provisioner "remote-exec" {
    when = "create"

    inline = [
      "sudo shutdown -r +60",
      "echo 0",
    ]
  }

@ghost
Copy link

ghost commented Apr 30, 2019

If anyone is fighting with this on Linux (connection actively refused error) I've written a little PowerShell/Bash combo that should cover Terraform running on both Windows and Linux: https://gist.github.com/janoszen/9df88ba0b906af1c18c0812a7128af7a

@andrewsav-bt
Copy link

@frafra hm... there is no mention of shutdown in that script you linked.

ereslibre pushed a commit to SUSE/skuba that referenced this issue Jun 13, 2019
Provisioner waits for exit code from shutdown command and fails because
reboot is performed too fast.

Fixes bsc#1135937
Upstream issue: hashicorp/terraform#17844
@frafra
Copy link

frafra commented Jun 13, 2019

@frafra hm... there is no mention of shutdown in that script you linked.

I moved the commands in a shell scripts that gets executed by TF; it is in the same repository :-)

@invidian
Copy link
Contributor

Shameless plug here, but maybe it actually helps someone to get reasonable workaround for this issue. I created TF provider, which is able to execute the comment, but ignore the result for that purpose and I don't have any problems with reboots now. The configuration is limited, but can be easily extended. Also Windows is not supported.

https://github.com/invidian/terraform-provider-sshcommand

@brunotm
Copy link

brunotm commented Jul 23, 2019

I have a done a quick implementation of the allow_missing_exit_status at #22180 as described by @apparentlymart to handle this case, tested on both linux and windows systems.

I'm not totally sold on this or something more general as "ignore_errors" that would allow more use cases and weird stuff.

@AndrewSav
Copy link
Author

Today got panic in terraform on VM reboot (Terraform version: 0.12.3)

2019-07-26T07:06:02.722+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [ERROR] scp stderr: "Sink: C0644 32 terraform_1671735816.sh\n"
2019-07-26T07:06:02.722+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.725+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting remote command: chmod 0777 /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.731+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] remote command exited with '0': chmod 0777 /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.732+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.734+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting remote command: /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.759+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] remote command exited with '0': /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.760+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.760+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Starting remote scp process:  scp -vt /tmp
2019-07-26T07:06:02.763+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Started SCP session, beginning transfers...
2019-07-26T07:06:02.763+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Copying input data into temporary file so we can read the length
2019-07-26T07:06:02.764+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Beginning file upload...
2019-07-26T07:06:02.768+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] SCP session complete, closing stdin pipe.
2019-07-26T07:06:02.768+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Waiting for SSH session to complete.
2019-07-26T07:06:02.769+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [ERROR] scp stderr: "Sink: C0644 0 terraform_1671735816.sh\n"
2019/07/26 07:06:02 [TRACE] EvalApplyProvisioners: provisioning module.node.vsphere_virtual_machine.machine with "remote-exec"
2019/07/26 07:06:02 [TRACE] GetResourceInstance: vsphere_virtual_machine.machine is a single instance
2019-07-26T07:06:02.771+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] connecting to TCP connection for SSH
2019-07-26T07:06:02.772+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] handshaking with SSH
2019-07-26T07:06:02.849+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting ssh KeepAlives
2019-07-26T07:06:02.849+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:03.137+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:03 [WARN] ssh session open error: 'ssh: unexpected packet in response to channel open: <nil>', attempting reconnect
2019-07-26T07:06:03.137+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:03 [DEBUG] connecting to TCP connection for SSH
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: panic: runtime error: invalid memory address or nil pointer dereference
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: [signal 0xc0000005 code=0x0 addr=0x0 pc=0x17a8b7c]
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: 
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: goroutine 258 [running]:
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: github.com/hashicorp/terraform/communicator/ssh.(*Communicator).Connect.func1(0xc000180b40, 0x223fe40, 0xc000519300)
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: 	/opt/teamcity-agent/work/9e329aa031982669/src/github.com/hashicorp/terraform/communicator/ssh/communicator.go:235 +0x12c
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: created by github.com/hashicorp/terraform/communicator/ssh.(*Communicator).Connect
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: 	/opt/teamcity-agent/work/9e329aa031982669/src/github.com/hashicorp/terraform/communicator/ssh/communicator.go:227 +0x519
2019/07/26 07:06:04 [WARN] Errors while provisioning vsphere_virtual_machine.machine with "remote-exec", so aborting
2019/07/26 07:06:04 [TRACE] EvalApplyProvisioners: module.node.vsphere_virtual_machine.machine provisioning failed, but we will continue anyway at the caller's request
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalMaybeTainted
2019/07/26 07:06:04 [TRACE] EvalMaybeTainted: module.node.vsphere_virtual_machine.machine encountered an error during creation, so it is now marked as tainted
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalWriteState
2019/07/26 07:06:04 [TRACE] EvalWriteState: writing current state object for module.node.vsphere_virtual_machine.machine
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalIf
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalIf
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalWriteDiff
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalApplyPost
2019/07/26 07:06:04 [ERROR] module.node: eval: *terraform.EvalApplyPost, err: 1 error occurred:
	* rpc error: code = Unavailable desc = transport is closing

2019/07/26 07:06:04 [ERROR] module.node: eval: *terraform.EvalSequence, err: rpc error: code = Unavailable desc = transport is closing
2019/07/26 07:06:04 [TRACE] [walkApply] Exiting eval tree: module.node.vsphere_virtual_machine.machine
2019/07/26 07:06:04 [TRACE] vertex "module.node.vsphere_virtual_machine.machine": visit complete
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provisioner.file (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provider.vsphere (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provisioner.remote-exec (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "root" errored, so skipping
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: reading latest snapshot from terraform.tfstate
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: snapshot file has nil snapshot, but that's okay
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: read nil snapshot
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: no original state snapshot to back up
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 1
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: unlocked by closing terraform.tfstate
2019-07-26T07:06:04.870+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=20112 error="exit status 2"
2019-07-26T07:06:04.870+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=25320
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=24520
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.889+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=19932
2019-07-26T07:06:04.889+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.891+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=18572
2019-07-26T07:06:04.891+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.892+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=16888
2019-07-26T07:06:04.892+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.893+1200 [DEBUG] plugin: plugin process exited: path=E:\Sources\docker_ops\terraform\instances\t-ap-test-01\.terraform\plugins\windows_amd64\terraform-provider-vsphere_v1.12.0_x4.exe pid=27036
2019-07-26T07:06:04.893+1200 [DEBUG] plugin: plugin exited

@brunotm
Copy link

brunotm commented Jul 25, 2019

Today got panic in terraform on VM reboot (Terraform version: 0.12.3)

@AndrewSav it does not look related, but could you try the the change on #22180 ?

@frafra
Copy link

frafra commented Oct 9, 2019

New proposed solution:

  provisioner "remote-exec" {
    inline = [ "reboot" ]
    on_failure = "continue"
    connection { host = self.ipv4_address }
  }

alekseytols90 pushed a commit to alekseytols90/skuba that referenced this issue Mar 31, 2020
Provisioner waits for exit code from shutdown command and fails because
reboot is performed too fast.

Fixes bsc#1135937
Upstream issue: hashicorp/terraform#17844
@AndrewSav
Copy link
Author

@frafra for what it's worth, I'm still getting connection errors intermittently even with on_failure = "continue" with next provisioned not being able to execute.

@moqmar
Copy link

moqmar commented Jun 1, 2020

I found systemctl reboot to work fine, while reboot throws an error.

@AndrewSav
Copy link
Author

The problem is that it's a race. So you change something, timing slightly changes and it works once and you think you fixed it, but it intermittently keeps failing.

@roshanp85
Copy link

allow_missing_exit_status

Is this available for terraform 0.12.24 ? I am running into issue : An argument named "allow_missing_exit_status" is not expected here. I am using the provider null 2.1.2.

@AndrewSav
Copy link
Author

@roshanp85 no.

@mysticaltech
Copy link

Hey folks, just to confirm that rebooting with shutdown -r +0 at the end of a remote-exec bloc, does work! Look no further, that is your solution! Thanks again @frafra 🙏

@AndrewSav
Copy link
Author

@mysticaltech as I explained above, this is a race, sometimes it works, sometimes it does not. We need a stable solution that always works.

@mysticaltech
Copy link

mysticaltech commented Jul 10, 2021

Thanks for the clarification @AndrewSav, so at least it seems to be winning that race most often than not. But maybe giving it some buffer time, to allow for the node to "calm down" would maximize those bets of winning. So adding a sleep 10 before for instance. Either way, as you said, we need a stable solution.

And IMHO, it should not be within the remote-exec provisioner, but be a different one, specifically built to handle that scenario and the remote behavior that results from it (it would be a lot simpler to achieve that way, as the intent and expected outcome would be clear).

Ansible does it well, Terraform should already have a solution for that, it's long overdue! Of course, instances need to reboot, especially after upgrades! I wish I knew Golang, it should be pretty quick, just copying the remote-exec provisioner code and modifying it a little, I would imagine.

@MJSanfelippo
Copy link

Are there any updates on this? It feels like something that Terraform should have.

@mysticaltech
Copy link

Yes, not only that, it doesn't seem like sorcery to implement!

@AndrewSav
Copy link
Author

@mysticaltech cool, I'm glad it looks easy for you, I hope to see your PR implementing it soon!

@mysticaltech
Copy link

@AndrewSav I'm like a true TF newbie, so probably not suited for this. But I am sure the folks at @hashicorp-cloud can make this happen in the blink of an eye!

@tomchomiak
Copy link

Bump

@zcemycl
Copy link

zcemycl commented May 24, 2022

Refer to this solved issue https://github.com/hashicorp/terraform/issues/18517#issue-343471291, you have to change your ssh settings to allow that. For example, there are 3 bash scripts, 2 reboots in-between,

Bash script1,

sudo apt update
sudo apt -y upgrade
do_something()
sudo shutdown -r now

Bash script2,

do_something()
sudo shutdown -r now

Bash script3,

do_something()

With a normal way of running these provisioners remote-exec in aws instance, it will return wait: remote command exited without exit status or exit signal.

To solve this, you should change settings in /etc/ssh/sshd_config,

...
ClientAliveInterval 120
ClientAliveCountMax 720
...

This allows your system to keep pinging your ec2 instance until ssh is connected again after 120s x 720 times.

Hope this works again.

@mysticaltech
Copy link

On my end, I circumvented the problem using this method:

Issue a reboot command and wait for MicroOS to reboot and be ready

  provisioner "local-exec" {
    command = <<-EOT
      ssh ${local.ssh_args} root@${self.ipv4_address} '(sleep 2; reboot)&'; sleep 3
      until ssh ${local.ssh_args} -o ConnectTimeout=2 root@${self.ipv4_address} true 2> /dev/null
      do
        echo "Waiting for OS to reboot and become available..."
        sleep 3
      done
    EOT
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests