Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Plugin panic during format-remove #1062

Closed
pdhamdhere opened this issue Mar 22, 2017 · 11 comments · Fixed by #1066
Closed

Plugin panic during format-remove #1062

pdhamdhere opened this issue Mar 22, 2017 · 11 comments · Fixed by #1066
Assignees

Comments

@pdhamdhere
Copy link
Contributor

https://ci.vmware.run/vmware/docker-volume-vsphere/1759

Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: fatal error: unexpected signal during runtime execution
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: [signal 0xb code=0x1 addr=0xffffffffff600400 pc=0xffffffffff600400]
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: runtime stack:
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: runtime.throw(0xa423c0, 0x2a)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /usr/local/go/src/runtime/panic.go:530 +0x90
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: runtime.sigpanic()
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: goroutine 40 [syscall, locked to thread]:
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: runtime.cgocall(0x773070, 0xc8206dd450, 0xc800000000)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc8206dd438 sp=0xc8206dd408
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: runtime.cmalloc(0x36, 0x35)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /usr/local/go/src/runtime/cgocall.go:149 +0x43 fp=0xc8206dd468 sp=0xc8206dd438
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk/vmdkops._Cfunc_CString(0xc82044e2c0, 0x35, 0x35)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         ??:0 +0x28 fp=0xc8206dd4d8 sp=0xc8206dd468
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk/vmdkops.EsxVmdkCmd.Run(0xc8202366a8, 0x9a7d60, 0x6, 0xc8205ab260, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /drone/src/github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk/vmdkops/esx_vmdkcmd.go:85 +0x320 fp=0xc8206dd6c8 sp=0xc8206dd4d8
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk/vmdkops.VmdkOps.Remove(0x7fae19182a50, 0xc8202366a8, 0xc8205ab260, 0xf, 0x0, 0x0, 0x0)
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]:         /drone/src/github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk/vmdkops/vmdkops.go:58 +0x185 fp=0xc8206dd780 sp=0xc8206dd6c8
Mar 22 06:33:00 photon-klJ0Zth7o docker-volume-vsphere[576]: github.com/vmware/docker-volume-vsphere/vmdk_plugin/drivers/vmdk.(*VolumeDriver).Remove(0xc820214ac0, 0xc8205ab260, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

CC / @govint

@pdhamdhere
Copy link
Contributor Author

We should upgrade go to 1.7 and see if it is reproducible.

@govint
Copy link
Contributor

govint commented Mar 22, 2017

    vmdkCmd.Mtx.Lock()
    defer vmdkCmd.Mtx.Unlock()

    jsonStr, err := json.Marshal(&requestToVmci{
            Ops:     cmd,
            Details: VolumeInfo{Name: name, Options: opts}})
    if err != nil {
            return nil, fmt.Errorf("Failed to marshal json: %v", err)
    }

Line:85: cmdS := C.CString(string(jsonStr)) <----------- caused panic
defer C.free(unsafe.Pointer(cmdS))

Checking plugin logs for the remove command that caused this.

@govint
Copy link
Contributor

govint commented Mar 22, 2017

The volume (concurrent test on the same host) was created and on the remove hit the panic and plugin restarted. The voilume remove was retried and now its stuck (similar problem that was solved in the earlier issue where the volume was left attached on panic during volume create).

Plugin logs:

Volume is created
2017-03-22 06:33:00.598077923 +0000 UTC [INFO] Attaching volume and creating filesystem name=volTestP-same11 fstype=ext4
2017-03-22 06:33:00.961160049 +0000 UTC [INFO] Scan complete device="/dev/disk/by-path/pci-0000:0b:00.0-scsi-0:0:0:0" event="/dev/disk/by-path/pci-0000:0b:00.0-scsi-0:0:0:0": 0x100 == IN_CREATE

Volume remove is initiated and plugin restarts

2017-03-22 06:33:01.278789815 +0000 UTC [INFO] No config file found. Using defaults.
2017-03-22 06:33:01.278899268 +0000 UTC [WARNING] Failed to load config file /etc/docker-volume-vsphere.conf: open /etc/docker-volume-vsphere.conf: no such file or directory
2017-03-22 06:33:01.278909137 +0000 UTC [INFO] Starting plugin log_level=info config="/etc/docker-volume-vsphere.conf" driver=vsphere
2017-03-22 06:33:01.278918299 +0000 UTC [INFO] Plugin options - port=1019
2017-03-22 06:33:01.278927973 +0000 UTC [INFO] Docker VMDK plugin started version="vSphere Volume Driver v0.4" port=1019 mock_esx=false
2017-03-22 06:33:01.278985154 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock"

Volume remove was retried and fails because the earlier panic left the volume attached to the VM.

2017-03-22 06:33:16.123089018 +0000 UTC [INFO] Removing volume name=volTestP-same11
2017-03-22 06:33:16.184371682 +0000 UTC [ERROR] Failed to remove volume error="Failed to remove volume /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/volTestP-same11.vmdk, in use by VM uuid = 564d85dd-efaf-374f-01f5-70ec76df0cce." name=volTestP-same11

@govint
Copy link
Contributor

govint commented Mar 22, 2017

Just guessing, we may be hitting a leak (from documentation, C.malloc() if ever it fails to allocate memory will crash the program - see https://golang.org/cmd/cgo/)

@pshahzeb pshahzeb assigned pshahzeb and govint and unassigned pshahzeb Mar 22, 2017
@pdhamdhere
Copy link
Contributor Author

pdhamdhere commented Mar 23, 2017

Looks like this issue got exposed with #1058. Before #1058, there was implicit delay of 1+sec and Concurrency test wasn't fully concurrent!

I propose we disable part of concurrency test
https://github.com/vmware/docker-volume-vsphere/blob/master/vmdk_plugin/sanity_test.go#L305
L305 - L326 to unblock other PRs.
a. Issue is 100% reproducible and investigation can continue on local setup.
b. It's unlikely for User to run into this. (This is a stress test)

What do you think @shuklanirdesh82 & @govint ?

pdhamdhere added a commit that referenced this issue Mar 23, 2017
Test will be re-enabled once we root cause & fix #1062
@shuklanirdesh82
Copy link
Contributor

@pdhamdhere's comment #1062 (comment)

I propose we disable part of concurrency test
It's unlikely for User to run into this. (This is a stress test)

Yeah agree! I would rather request to comment out the whole test as it causes more delay (for PRs to merge) than expected mention #1057

@govint
Copy link
Contributor

govint commented Mar 23, 2017

But how is this related to the wait for device attach? The issue should then happen with a larger load of create volume requests to the managed plugin?

pdhamdhere added a commit that referenced this issue Mar 23, 2017
Test will be re-enabled once we root cause & fix #1062
@pdhamdhere
Copy link
Contributor Author

#1066 disabled tests but we still need to root cause panic caused with concurrency test.

@pdhamdhere pdhamdhere reopened this Mar 23, 2017
@govint
Copy link
Contributor

govint commented Mar 27, 2017

Have been running a larger config of up to 500 volumes and more threads for the concurrency test with Photon 4.4.41 and Ubuntu guests on ESXi6.0u4. Don't have a repro yet. From the CI runs at least it looks like the Go/C heap is corrupted. Conversion of a GO string to C causes a malloc() to create the C string. This is where the panic is seen. Used the debug pkg to get a stack of all the threads and checking that.

Currently trying out with more debug logs and check for mem overruns and with single and multiple threads.

Another issue may be with thread safety of these C calls that the plugin makes, the issue happens with the concurrent tests. So, thats one more angle to investigate.

@govint
Copy link
Contributor

govint commented Mar 30, 2017

Tried the following and unable to repro the issue anymore, the hunch was that the buffer being used to collect the response may be overflowing causing the runtime corruption.

  1. Skip malloc'ing the response buffer and use a static sized array in the response structure. Panic doesn't repro, but using a static sized buffer isn't a solution given the response size can be larger (list can return >8K of data with even a moderate number of volumes)
  2. Changed the calloc() in vsock_reply to a malloc - issue doesn't repro (https://ci.vmware.run/vmware/docker-volume-vsphere/1941).
  3. While (2) seemed to point that the calloc() does in the C code may somehow be causing GO/C memory corruption, I tried out a build with no changes except to have all the concurrent tests enabled - cross VM, same VM and clone. Just to see that the original panic is reproduced with the current code. The build was successful - https://ci.vmware.run/vmware/docker-volume-vsphere/1947.

It seems the issue was Go/C run time internal as with no code changes the issue isn't repro'ing any more with all tests enabled.

@pdhamdhere, do we open up the concurrent tests in the code as the issue isn't getting reproduced.

@tusharnt tusharnt modified the milestones: 0.14, 0.13 Mar 31, 2017
@govint
Copy link
Contributor

govint commented Apr 4, 2017

Fixed via #1118

@govint govint closed this as completed Apr 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants