Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH password connection repeatedly fails to connect in vSphere #4112

Closed
ryanl-ee opened this issue Nov 30, 2015 · 42 comments
Closed

SSH password connection repeatedly fails to connect in vSphere #4112

ryanl-ee opened this issue Nov 30, 2015 · 42 comments

Comments

@ryanl-ee
Copy link

This may or may not be related to my other issue (#4111) which involves the NIC not being connected after provisioning. After I manually connect the NIC, vSphere applies the customization template, reboots, and Terraform begins to attempt connection. The log shows that its IP is correctly reported to Terraform, and I am able to manually SSH into the newly provisioned VM using its IP and the username/password combo specified in my .tf config (vagrant//vagrant).

Let me know if there is anything I can do to help test.

@chrislovecnm
Copy link
Contributor

@jen20 who is more familiar with chef?
@ryan-ee long shot but can you try master? Also is the IP address correct in the logs?

Chef is attempting to connect and is getting an authentication error from the ssh login.

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 2, 2015

The IP address is correct. I tried building master again and am getting the same error for make test in the Vagrant box :)

...
ok      github.com/hashicorp/terraform/builtin/providers/vsphere        0.037s
ok      github.com/hashicorp/terraform/builtin/provisioners/chef        0.322s
ok      github.com/hashicorp/terraform/builtin/provisioners/file        0.023s
ok      github.com/hashicorp/terraform/builtin/provisioners/local-exec  0.179s
--- FAIL: TestResourceProvider_CollectScripts_script (0.02s)
        resource_provisioner_test.go:120: bad: cd /tmp
                wget http://foobar
                exit 0
--- FAIL: TestResourceProvider_CollectScripts_scripts (0.03s)
        resource_provisioner_test.go:151: bad: cd /tmp
                wget http://foobar
                exit 0
FAIL
FAIL    github.com/hashicorp/terraform/builtin/provisioners/remote-exec 0.080s
ok      github.com/hashicorp/terraform/command  6.059s
ok      github.com/hashicorp/terraform/communicator     0.023s
ok      github.com/hashicorp/terraform/communicator/remote      0.009s
ok      github.com/hashicorp/terraform/communicator/ssh 0.057s
...

make release gives me make: *** [release] Error 1.

@chrislovecnm
Copy link
Contributor

@ryanl-ee that is the main build not happy ... welcome :) Do you know how to run the vsphere acceptance tests?

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 4, 2015

Not at first :) I figured out some of it by reading through the provider code & trial and error. I can contribute to the documentation when I have this all buttoned down. For now, for anyone else wanting to try this at home...

export VSPHERE_NETWORK_GATEWAY=192.168.1.1
export VSPHERE_NETWORK_IP_ADDRESS=<unused static IP>
export VSPHERE_NETWORK_LABEL="VM Network"
export VSPHERE_NETWORK_LABEL_DHCP="VM Network"
export VSPHERE_TEMPLATE="templates/ubuntu1404-nocm-2.0.10"
export VSPHERE_DATACENTER="Datacenter"
export VSPHERE_CLUSTER="Cluster"
export VSPHERE_DATASTORE="Datastore"
export VSPHERE_USER=administrator@vsphere.local
export VSPHERE_PASSWORD="thepassword"
export VSPHERE_SERVER="<vCenter Server IP>"
export VSPHERE_ALLOW_UNVERIFIED_SSL=true
make testacc TEST=./builtin/providers/vsphere

Per #4111 I have to go in and connect the NICs of the created test VMs. They all fail with Attribute 'network_interface.#' expected "1", got "0" but I'm not sure how to fix that. I'm on 283a838 which is the latest commit as of right now.

Acceptance test logs: https://gist.github.com/ryanl-ee/cb58e2026478bd5cd1b0

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 5, 2015

It occurred to me that going in and reconnecting the NIC could be affecting the results... I ran it again without any intervention and got the same results. I think I know what the issue is, though. I tried cloning my specified VM manually. When my clone boots, it tries to get DHCP for 60+ seconds. When that fails, it fully boots, starts VMware tools, reconfigures itself, and then successfully gets the customization applied to it, but when they're cloned via the acceptance test it doesn't wait long enough. Is there a way to specify the timeout in the acceptance test? I can't seem to find an environment variable that allows me to set that. Alternatively I could modify the source VM to not wait for DHCP to boot, but that would be more of an issue with Boxcutter :)

@chrislovecnm
Copy link
Contributor

@tkak any ideas?

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 7, 2015

I modified my source VM to only wait 10 seconds. I tried directly provisioning with Terraform again and got this from the INFO log, which looks interesting & maybe reveals something that TRACE doesn't...

vsphere_virtual_machine.gitlab (chef): Connecting to remote host via SSH...
vsphere_virtual_machine.gitlab (chef):   Host:
vsphere_virtual_machine.gitlab (chef):   User: vagrant
vsphere_virtual_machine.gitlab (chef):   Password: true
vsphere_virtual_machine.gitlab (chef):   Private key: false
vsphere_virtual_machine.gitlab (chef):   SSH Agent: false

Should Host: be empty?

I also tried again with the acceptance test. It passed the first 2 tests, but failed the last test with the same error above.

@chrislovecnm
Copy link
Contributor

@jen20 @phinze who is familiar with terraform's chef module?

@chrislovecnm
Copy link
Contributor

@ryanl-ee we need someone with a bit more experience with chef to help us

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 7, 2015

OK! On a hunch I tried again with the "remote-exec" provisioner instead of the "chef" provisioner and got the same error, i.e.:

  provisioner "remote-exec" {
    inline = [
    "sudo apt-get update",
    "sudo apt-get upgrade -y"
    ]
    connection {
      user = "vagrant"
      password = "vagrant"
      timeout = "10m"
    }
  }

...

vsphere_virtual_machine.gitlab (remote-exec): Connecting to remote host via SSH...
vsphere_virtual_machine.gitlab (remote-exec):   Host:
vsphere_virtual_machine.gitlab (remote-exec):   User: vagrant
vsphere_virtual_machine.gitlab (remote-exec):   Password: true
vsphere_virtual_machine.gitlab (remote-exec):   Private key: false
vsphere_virtual_machine.gitlab (remote-exec):   SSH Agent: false

@phinze
Copy link
Contributor

phinze commented Dec 7, 2015

@chrislovecnm I don't think this is chef-specific, since the Host is empty it would apply to remote-exec provisioner as well.

The host is specified during SetConnInfo which is called here.

Looks like it's trying to pull out the IP address from the first network interface, but something is causing it to fail.

@phinze
Copy link
Contributor

phinze commented Dec 7, 2015

@ryanl-ee if you turn on debug logging, TF_LOG=debug, I'd be curious to see what the output of this line is when you get this behavior.

@chrislovecnm
Copy link
Contributor

@ryanl-ee agreed. This is something going on in the remote communicator. I think in here:

https://github.com/hashicorp/terraform/blob/master/communicator/ssh/communicator.go

I am making an educated guess that the host should be printing out. Can you post debug to a gist?

Thanks

@chrislovecnm
Copy link
Contributor

@phinze the code in the remote communicator is not printing out the ip or hostname. We're does this come from?

@phinze
Copy link
Contributor

phinze commented Dec 7, 2015

@chrislovecnm that has to be set by the resource by calling SetConnInfo - looks like for some reason that's not being called in this case

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 7, 2015

@chrislovecnm Here you go, fresh gist: https://gist.github.com/ryanl-ee/f5211ab876589274d024

@phinze I wasn't sure how to identify the expected output of that line, sorry! Out of curiosity what should I look for?

@phinze
Copy link
Contributor

phinze commented Dec 7, 2015

@ryanl-ee heh looks like we have to count since there's no prefix

Should be the third debug line after "Created virtual machine"

2015/12/07 12:38:25 [DEBUG] terraform-provider-vsphere: 2015/12/07 12:38:25 [DEBUG] []types.GuestNicInfo(nil)

^^ Yep so the machine for whatever reason is coming back with no NIC details. So there are no networkInterfaces pulled out, and SetConnInfo is never called.

The next question is, why would there be no network interfaces?

@ryanl-ee
Copy link
Author

ryanl-ee commented Dec 7, 2015

OK, so my understanding of the order of operations:

  • Terraform starts the clone process on a VM
  • vSphere clones the VM and boots it with no NIC on purpose (https://communities.vmware.com/message/2405839#2405839)
  • Customization completes, the VM is rebooted and the NIC is connected
  • VM gets connected to the network with its unique MAC & hostname

I thought maybe TF was grabbing the VM status before customization was completed, but it does have an IP by that time (https://gist.github.com/ryanl-ee/f5211ab876589274d024#file-tf-log-L4130) so that should be the final configuration. Here is the gitlab.vmx I pulled down, which does reflect the ethernet0 configuration I see in vSphere:

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "11"
vmci0.present = "TRUE"
floppy0.present = "FALSE"
memSize = "1024"
sched.cpu.units = "mhz"
sched.cpu.latencySensitivity = "normal"
tools.upgrade.policy = "manual"
scsi0.virtualDev = "lsilogic"
scsi0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "gitlab.vmdk"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
displayName = "gitlab"
guestOS = "ubuntu-64"
cpuid.80000001.eax = "--------------------------------"
cpuid.80000001.ebx = "--------------------------------"
cpuid.80000001.ecx = "--------------------------------"
cpuid.80000001.edx = "-----------H--------------------"
cpuid.80000001.eax.amd = "--------------------------------"
cpuid.80000001.ebx.amd = "--------------------------------"
cpuid.80000001.ecx.amd = "--------------------------------"
cpuid.80000001.edx.amd = "-----------H--------------------"
toolScripts.afterPowerOn = "TRUE"
toolScripts.afterResume = "TRUE"
toolScripts.beforeSuspend = "TRUE"
toolScripts.beforePowerOff = "TRUE"
tools.syncTime = "FALSE"
tools.guest.desktop.autolock = "FALSE"
messageBus.tunnelEnabled = "FALSE"
uuid.bios = "42 14 9d 19 3a 26 fb 5e-62 43 d7 9d eb c3 c7 37"
vc.uuid = "50 14 65 0d b9 a4 8e f7-6e a9 1b 1f 43 9a cc 32"
nvram = "gitlab.nvram"
pciBridge0.present = "TRUE"
svga.present = "TRUE"
pciBridge4.present = "TRUE"
pciBridge4.virtualDev = "pcieRootPort"
pciBridge4.functions = "8"
pciBridge5.present = "TRUE"
pciBridge5.virtualDev = "pcieRootPort"
pciBridge5.functions = "8"
pciBridge6.present = "TRUE"
pciBridge6.virtualDev = "pcieRootPort"
pciBridge6.functions = "8"
pciBridge7.present = "TRUE"
pciBridge7.virtualDev = "pcieRootPort"
pciBridge7.functions = "8"
hpet0.present = "true"
monitor.phys_bits_used = "42"
pciBridge0.pciSlotNumber = "17"
pciBridge4.pciSlotNumber = "21"
pciBridge5.pciSlotNumber = "22"
pciBridge6.pciSlotNumber = "23"
pciBridge7.pciSlotNumber = "24"
replay.supported = "false"
scsi0.pciSlotNumber = "16"
softPowerOff = "FALSE"
virtualHW.productCompatibility = "hosted"
vmci0.pciSlotNumber = "32"
vmotion.checkpointFBSize = "4194304"
vmotion.checkpointSVGAPrimarySize = "4194304"
migrate.hostlog = "gitlab-650ebb2c.hlog"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "0"
sched.mem.minSize = "0"
sched.mem.shares = "normal"
ethernet0.virtualDev = "vmxnet3"
ethernet0.networkName = "VM Network"
ethernet0.addressType = "vpx"
ethernet0.generatedAddress = "00:50:56:94:dd:76"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
sched.swap.derivedName = "/vmfs/volumes/565c7a3c-ae8754bc-ee38-842b2b61667e/gitlab/gitlab-c6d8ec4e.vswp"
uuid.location = "56 4d f0 1a b2 4a a1 21-39 db dc 6d 72 e5 fb 43"
replay.filename = ""
scsi0:0.redo = ""
ethernet0.pciSlotNumber = "160"
vmci0.id = "-339491017"
cleanShutdown = "FALSE"

@phinze
Copy link
Contributor

phinze commented Dec 7, 2015

Okay so why would we see an IP address in the debug line that @ryanl-ee linked, but not get one back on mvm.Guest.Net? Any ideas there @chrislovecnm?

@chrislovecnm
Copy link
Contributor

@phinze that might be an API thing. I am actually on vacation. Might take me a bit to work out. @tkak you have any ideas?

@tkak
Copy link
Contributor

tkak commented Dec 12, 2015

I'm not sure why mvm.Guest.Net is missing. But, it seems to be able to get IP address with WaitForIP(). So, I tried to change the host parameter in the connection info in my PR. But, the issue of missing mvm.Guest.Net remains...

@chrislovecnm
Copy link
Contributor

@pietern any advice on diagnosing why Guest.Net is not returning correct data?

@chrislovecnm
Copy link
Contributor

I just filed vmware/govmomi#405 to loop in the api owners. We kinda need Guest.Net to work cleanly ...

@chrislovecnm
Copy link
Contributor

Another related issue: #4302 - not a duplicate, but in the same code block.

@tkak
Copy link
Contributor

tkak commented Dec 18, 2015

It seems that there is a time difference between the time ipAddress and net (GuestNicInfo[]) are reflected. Maybe, it was about some seconds. I guess the time lag is the cause that Guest.Net is missing. So, we should use IP address from WaitForIP function instead Guest.Net.

screen shot 2015-12-18 at 10 59 15 am

screen shot 2015-12-18 at 12 09 23 pm

@chrislovecnm
Copy link
Contributor

We can do a wait for ip on Guest.net actually ...

@pietern
Copy link

pietern commented Dec 27, 2015

You can use the property collector to wait for the net property to change to non-nil. The WaitForIP call does the same thing, but for the ipAddress property. If this is a timing issue, and you need the net property here, you can use the WaitForIP function body as a template and make a custom WaitForNet call that blocks until the net property is set.

@chrislovecnm
Copy link
Contributor

Thanks @pietern - @tkak make sense?

@ryanl-ee
Copy link
Author

Is there any way I can help test this? I would love for this to be in master. Thanks again for the attention so far!

@ryanl-ee
Copy link
Author

Hi folks, just wondering if there was anything for this in the works. I'm trying to do all sorts of deployments and am unable to use any provisioner aside from 'local-exec', recreating (in very basic ways) what the provisioner does in a simple shell script.

@phinze
Copy link
Contributor

phinze commented Mar 23, 2016

Hi @ryanl-ee - the core team is at a bit of a loss at the moment since we are without a vSphere environment for testing. We're working on that, but progress is slow!

It's possible that #5558 might have improved things somewhat in Terraform v0.6.14 - can you give that version another shot to see if remote-exec behaves better for you there?

@chrislovecnm
Copy link
Contributor

@ryani-ee can you assist with testing? The code changes are pretty simple actually.

@chrislovecnm
Copy link
Contributor

@phinze I don't think #5558 would help, and we have to wait for vsphere to get the ip attached properly. This is probably a pure timing issue.

@chrislovecnm
Copy link
Contributor

@phinze could be wrong ;)

@chrislovecnm
Copy link
Contributor

@ryanl-ee sorry I typo'ed your username. Would you be able to test for us?

@ryanl-ee
Copy link
Author

@chrislovecnm Sure, I'd be happy to. What do I need to do?

@chrislovecnm
Copy link
Contributor

@ryanl-ee I need to submit a PR and then we need to work together. Let me see if I can get the code done this weekend

@chrislovecnm
Copy link
Contributor

@phinze any word on a test bed? If you want to reach out to me, I may be able to assist.

@chrislovecnm
Copy link
Contributor

@ryanl-ee I see that a PR is already in, but I moved the WaitForIp up in the method. You can try out my branch here: https://github.com/chrislovecnm/terraform/tree/vsphere-ip-wait-issue (I have not tested, only compiles)

Have you tried @tkak's PR? #4283

p.s. if you need help getting your dev environment up, please ping me

@chrislovecnm
Copy link
Contributor

@xantheran this is a good way to test the branch I have as well :)

@stack72
Copy link
Contributor

stack72 commented Apr 21, 2016

Fixed in #4283 (by way of #6293)

@ghost
Copy link

ghost commented Apr 26, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants