Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Device attach fails on volume creation due to duplicate bios.uuid #1373

Closed
govint opened this issue Jun 8, 2017 · 9 comments · Fixed by #1526
Closed

Device attach fails on volume creation due to duplicate bios.uuid #1373

govint opened this issue Jun 8, 2017 · 9 comments · Fixed by #1526

Comments

@govint
Copy link
Contributor

govint commented Jun 8, 2017

Copy-n-paste from email :

    Hi,
    
    We are having some trouble getting the docker-volume-vsphere plugin to work. Do you see anything that we may be missing?
    
    The details and logs are below.
    
    Thanks!
    XXX

    Environment Details: 1 VM on RHEL 7.3, ESXi 6.0U1, Docker 17.05.0-ce, vDVS 0.14
    
    Steps to Reproduce:
    docker plugin install --grant-all-permissions --alias vsphere vmware/docker-volume-vsphere:0.14
    docker volume create --driver=vsphere --name=test
    
    Actual Result:
    Error response from daemon: create test: VolumeDriver.Create: Failed to create filesystem on /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0: exit status 1. Output = mke2fs 1.43.3 (04-Sep-2016)
    The file /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0 does not exist and no size was specified.
    
    Client Logs:
    2017-06-07 19:32:35.514944609 +0000 UTC [INFO] Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf"
    2017-06-07 19:32:35.515383164 +0000 UTC [INFO] Plugin options - port=1019
    2017-06-07 19:32:35.515411358 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
    2017-06-07 19:32:35.539957187 +0000 UTC [INFO] Found 6 running or paused containers
    2017-06-07 19:32:35.546933529 +0000 UTC [INFO] Discovered 0 volumes in use.
    2017-06-07 19:32:35.546966645 +0000 UTC [INFO] Refcounting successfully completed
    2017-06-07 19:32:35.546986052 +0000 UTC [INFO] Docker VMDK plugin started version="vSphere Volume Driver v0.4" port=1019 mock_esx=false
    2017-06-07 19:32:35.54727681 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock"
    2017-06-07 19:07:25.347595558 +0000 UTC [DEBUG] vmdkOps.Get name=test
    2017-06-07 19:07:25.34765686 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:25.563177483 +0000 UTC [DEBUG] vmdkOp.Create name=test
    2017-06-07 19:07:25.563228464 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:25.816647424 +0000 UTC [INFO] Attaching volume and creating filesystem name=test fstype=ext4
    2017-06-07 19:07:25.818644055 +0000 UTC [DEBUG] vmdkOps.Attach name=test
    2017-06-07 19:07:25.818684762 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:44.342576236 +0000 UTC [WARNING] Exceeded timeout while waiting for device attach to completetimeout=10s device="/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0"
    2017-06-07 19:07:44.344666589 +0000 UTC [ERROR] Create filesystem failed, removing the volume name=test error="Failed to create filesystem on /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0: exit status 1. Output = mke2fs 1.43.3 (04-Sep-2016)\nThe file /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0 does not exist and no size was specified.\n"
    2017-06-07 19:07:44.344716477 +0000 UTC [DEBUG] vmdkOps.Detach name=test
    2017-06-07 19:07:44.344730957 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:47.108149631 +0000 UTC [DEBUG] vmdkOps.Remove name=test
    2017-06-07 19:07:47.10819686 +0000 UTC [DEBUG] Run get request: version=
    
    ESX Host Logs:
    06/07/17 23:07:39 1420822 [Thread-11] [INFO   ] cmd get with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:39 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'get' completed with ret={u'Error': 'Volume test not found (file: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk)'}
    06/07/17 23:07:39 1420822 [Thread-12] [WARNING] Volume size not specified
    06/07/17 23:07:39 1420822 [Thread-12] [WARNING] Volume size not specified
    06/07/17 23:07:39 1420822 [Thread-12] [INFO   ] cmd create with opts {u'fstype': u'ext4'} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:39 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** createVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk opts = {u'fstype': u'ext4'}
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [WARNING] Volume size not specified
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'create' completed with ret=None
    06/07/17 23:07:40 1420822 [Thread-13] [INFO   ] cmd attach with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** attachVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk to env-a-4vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Attaching /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk as independent_persistent
    06/07/17 23:07:45 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Disk /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk successfully attached. controller pci_slot_number=160, disk_slot=1
    06/07/17 23:07:45 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'attach' completed with ret={'ControllerPciSlotNumber': '160', 'Unit': '1'}
    06/07/17 23:07:46 1420822 [Thread-14] [INFO   ] cmd detach with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:46 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** detachVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk from env-a-4vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Disk detached /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'detach' completed with ret=None
    06/07/17 23:07:48 1420822 [Thread-15] [INFO   ] cmd remove with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** removeVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** cleanVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'remove' completed with ret=None
    06/07/17 23:08:20 1421847 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui'
    
@pdhamdhere
Copy link
Contributor

Timing in log seems suspicious;

Plugin issued attach (19:07:25), timed out after 10Secs and issued detach after 12 seconds (19:07:44).
Server received attach (23:07:40) and detach after 6 seconds (23:07:46)

@shuklanirdesh82 shuklanirdesh82 added this to the 0.16 milestone Jun 9, 2017
@govint
Copy link
Contributor Author

govint commented Jun 9, 2017

Tried this out multiple times and the attach-wait times are anywhere between 6s - 10s. From debugging, the plugin starts its wait after the docker host kernel has completed attaching the device. There are no obvious delays seen in the kernel logs:

Docker host kernel logs:
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.069868] vmw_pvscsi: msg type: 0x0 - MSG RING: 64/63 (5)
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.069873] vmw_pvscsi: msg: device added at scsi0:1:0
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.070082] scsi 2:0:1:0: Direct-Access VMware Virtual disk 1.0 PQ: 0 ANSI: 2
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.074260] sd 2:0:1:0: Attached scsi generic sg4 type 0
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075813] sd 2:0:1:0: [sdd] 204800 512-byte logical blocks: (104 MB/100 MiB)
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075840] sd 2:0:1:0: [sdd] Write Protect is off
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075844] sd 2:0:1:0: [sdd] Mode Sense: 61 00 00 00
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075869] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075871] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.076083] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.076086] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710503] sdd: unknown partition table
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710716] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710720] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710743] sd 2:0:1:0: [sdd] Attached SCSI disk

Plugin logs
2017-06-08 21:00:52.856006649 -0700 PDT [DEBUG] vmdkOp.Create name=y_5
2017-06-08 21:00:52.856329133 -0700 PDT [DEBUG] Run get request: version=
2017-06-08 21:01:24.248953919 -0700 PDT [INFO] Attaching volume and creating filesystem name="y_5" fstype=ext4
2017-06-08 21:01:24.249156484 -0700 PDT [DEBUG] vmdkOps.Attach name=y_5
2017-06-08 21:01:24.249226682 -0700 PDT [DEBUG] Run get request: version=
2017-06-08 21:01:56.483686652 -0700 PDT [ERROR] Starting wait for attachname="y_5" error=
2017-06-08 21:02:02.581062994 -0700 PDT [DEBUG] event: "/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0": 0x100 == IN_CREATE
2017-06-08 21:02:02.581194605 -0700 PDT [INFO] Scan complete device="/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0" event="/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0": 0x100 == IN_CREATE

As seen above its taking around 6s to wait for the device to be detected by the plugin.

The issue is reproduced at times but from all the logs available doesn't seem to be an issue on the ESX side.

@govint
Copy link
Contributor Author

govint commented Jun 12, 2017

Summarizing, this doesn't seem to be a host (ESX) or VM issue. None of the logs point to any latencies in the host and the VM, except perhaps the GO watcher itself thats causing this behavior.

@govint govint self-assigned this Jun 13, 2017
@govint
Copy link
Contributor Author

govint commented Jun 13, 2017

@pdhamdhere, from all local tests was able to repro this issue some times and thereafter the issue didn't recur. I don't believe its a plugin or ESX service issue as there are no delays there that point to an obvious issue. I'll have an NFS datastore added and lets see if this is reproduced in CI. Else will close this issue.

@shuklanirdesh82 can you get a NFS DS added to the CI just so we can figure if this can happen there.

@govint
Copy link
Contributor Author

govint commented Jun 20, 2017

Local tests don't seem to repro this issue and at least not for the same reasons reported. Issue is being addressed over email, closing this issue.

@govint govint closed this as completed Jun 20, 2017
@pdhamdhere
Copy link
Contributor

Let's keep the issue open until we root cause the issue in this unique setup.

@pdhamdhere pdhamdhere reopened this Jun 20, 2017
@pdhamdhere pdhamdhere assigned msterin and unassigned govint Jun 20, 2017
@msterin msterin added the P1 label Jun 20, 2017
@msterin
Copy link
Contributor

msterin commented Jun 20, 2017

@govint writes

Summarizing, this doesn't seem to be a host (ESX) or VM issue.
None of the logs point to any latencies in the host and the VM, except perhaps the GO watcher itself thats causing this behavior.

The issue still exists and we are trying to get additional info from the setup where it was reported from, even though it's unique so far. Until we know the root cause it represents a significant risk as it's a blocker for whoever steps into it.

@msterin
Copy link
Contributor

msterin commented Jun 21, 2017

More info. Customer writes

Here is the information from the ESX side of things:

ESX: VMware ESXi 6.0.0 build-3380124
Storage: NFS

Another potential point of interest….we are deploying the Docker VM’s using a vApp template via the vCloud Director API and the VM is actually managed by our vCloud Director solution. This might be why the process is timing out when trying to attach the disk to the VM since vCloud owns the entity.

We may want to deploy a docker vm using our template but not via vCloud Director to determine if we experience the same behavior

and later he writes the following (hostnames and vm names redacted)

Line 20 of vmdk_ops.log:
*** attachVMDK: vmdk_name_removed to vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f

It looks like it’s attaching the volume to vm-a-1, but the docker volume create command was on vm-1

So the theory is that we are stepping into VM UUID uniqueness issues.
With vCloudDirector finding out unique UUID is not fun

We can do the following:

  • use vc.uuid instead of uuid.bios. This may present a problem if no VC is around

I'll take care of this

@tusharnt
Copy link
Contributor

tusharnt commented Jun 28, 2017

@msterin Are we still targeting this for 0.16 release? We've just 2 more days to go!

@msterin msterin changed the title Device attach fails on volume creation Device attach fails on volume creation due to duplicte bios.uuid Jun 29, 2017
@msterin msterin changed the title Device attach fails on volume creation due to duplicte bios.uuid Device attach fails on volume creation due to dupliicate bios.uuid Jun 29, 2017
@msterin msterin changed the title Device attach fails on volume creation due to dupliicate bios.uuid Device attach fails on volume creation due to duplicate bios.uuid Jun 29, 2017
msterin pushed a commit that referenced this issue Jun 30, 2017
When communication from VMCI vSocket is established n vmdk_ops.py, we
find out VM ID from
the socket, and then locate VM ManagedObject by this ID.

There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is
copied (e.g. VCD or other products), BIOS UUID can be duplicate thus
failures like #1373 .

However, VC UUID is unique on creation of VM, and unique as long as ESXi
is a
part of a VC. So it is a much better candidate for using as a unique ID.

This change tries to use VC UUID first, and fails back to BIOS UUID
next. The change also checks that the VM name (for a found VM by ID)
matches the one we see from vSocket, and does a few other minor prints.
shuklanirdesh82 pushed a commit that referenced this issue Jun 30, 2017
* Added proper return on protocol version mismatch

* Using VC UUID  for attach/detach where possible

When communcaiton from VMCI vSocket is established, we find out VM ID from
the socket. There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is
copied (e.g. VCD or other product), BIOS UUID can be dulicate. However,
VC UUID is unique on creation of VM, and unique as long as ESXi is a
part of a VC. So it is a much better candidate for using as a unique ID.

Thic change tries to use VC UUID first, and fails back to BIOS UUID
next. The change also checks that the VM name (for a found VM by ID)
matches the one we see from vSocket, and does a few other minor prints.

* Try to use VC UUID first for locating VMs, then fall back to BIOS UUID

When communication from VMCI vSocket is established n vmdk_ops.py, we
find out VM ID from
the socket, and then locate VM ManagedObject by this ID.

There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is
copied (e.g. VCD or other products), BIOS UUID can be duplicate thus
failures like #1373 .

However, VC UUID is unique on creation of VM, and unique as long as ESXi
is a
part of a VC. So it is a much better candidate for using as a unique ID.

This change tries to use VC UUID first, and fails back to BIOS UUID
next. The change also checks that the VM name (for a found VM by ID)
matches the one we see from vSocket, and does a few other minor prints.

* Unbundled detachVMDK and attachVMD to address review comments

* fixed the VM instance ID
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants