Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Test 5-6-1- VSAN Simple - container does not exit (InvalidPowerState) #5803

Closed
sgairo opened this issue Jul 24, 2017 · 23 comments
Closed
Assignees
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0
Milestone

Comments

@sgairo
Copy link
Contributor

sgairo commented Jul 24, 2017

Seen in 6.5:
5-6-1-VSAN-Simple.zip

screen shot 2017-07-24 at 12 06 43 pm

Initial Thoughts: Is this an issue confirming insecure registries?

@mdubya66 mdubya66 added the kind/defect Behavior that is inconsistent with what's intended label Jul 26, 2017
@sgairo
Copy link
Contributor Author

sgairo commented Jul 31, 2017

Seen in 13-1 vMotion VCH Appliance 6.5 7/30/17 nightly report:
13-1-vMotion-VCH-Appliance.zip

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 11, 2017

From 5-1-VSAN PL log:

Jul 23 2017 02:50:11.203Z DEBUG Setting container b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 state: Stopping
Jul 23 2017 02:50:11.203Z INFO  sending kill -TERM b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36
Jul 23 2017 02:50:11.203Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.(*containerBase).startGuestProgram:180] b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36:kill
Jul 23 2017 02:50:12.340Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.(*containerBase).startGuestProgram:180] [1.137335894s] b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36:kill
Jul 23 2017 02:50:12.340Z WARN  stopping b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 via hard power off due to: sending kill -TERM b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36: ServerFaultCode: A general system error occurred: vix error codes = (3016, 0).

Jul 23 2017 02:50:12.340Z DEBUG op=298.20 (delta:3.358µs): [NewOperation] op=298.20 (delta:1.66µs) [github.com/vmware/vic/pkg/vsphere/tasks.WaitForResult:65]
Jul 23 2017 02:50:12.519Z ERROR op=298.20 (delta:178.779763ms): unexpected fault on task retry : &types.InvalidPowerState{InvalidState:types.InvalidState{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}, RequestedState:"poweredOn", ExistingState:"poweredOff"}
Jul 23 2017 02:50:12.532Z WARN  power off b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 task skipped (state was already poweredOff)
Jul 23 2017 02:50:12.532Z DEBUG Set container b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 state: Stopped
Jul 23 2017 02:50:12.532Z DEBUG Container(b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36) closing 0 log followers
Jul 23 2017 02:50:12.532Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.(*Container).stop:374] [1.329403929s] 
... ...
Jul 23 2017 02:50:12.593Z INFO  Reconfigure: attempting update to b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 with change version "2017-07-23T02:49:42.345523Z" (poweredOn)
Jul 23 2017 02:50:12.593Z WARN  Reconfigure: collision of concurrent operations - expected power state poweredOff, found poweredOn
Jul 23 2017 02:50:12.593Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.Commit:36] [1.389929887s] b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36
Jul 23 2017 02:50:12.593Z ERROR CommitHandler error on handle(aa90f7501fcd3f6c39b907088dfdcaea) for b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36: exec.ConcurrentAccessError{err:(*errors.errorString)(0xc4210acff0)}
Jul 23 2017 02:50:12.593Z DEBUG [ END ] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).CommitHandler:215] [1.39054057s] handle(aa90f7501fcd3f6c39b907088dfdcaea)
Jul 23 2017 02:50:12.610Z DEBUG Found 2 subscribers to id: 0 - events.ContainerEvent: Container b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 Stopped

This seems a bit different from #5981 (comment) wherein the error msg is

Aug  9 2017 02:38:54.304Z ERROR op=298.20 (delta:131.846258ms): unexpected fault on task retry : &types.GenericVmConfigFault{VmConfigFault:types.VmConfigFault{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage{types.LocalizableMessage{DynamicData:types.DynamicData{}, Key:"msg.suspend.powerOff.notsuspended", Arg:[]types.KeyAnyValue(nil), Message:"The virtual machine is not suspended."}}}}}, Reason:"The virtual machine is not suspended."}

However, in this ticket the error msg is (inivalidpowerstate):

Jul 23 2017 02:50:12.519Z ERROR op=298.20 (delta:178.779763ms): unexpected fault on task retry : &types.InvalidPowerState{InvalidState:types.InvalidState{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}, RequestedState:"poweredOn", ExistingState:"poweredOff"}

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 11, 2017

From 13-1-vMotion-VCH PL log:

Jul 30 2017 07:53:37.968Z INFO  waiting 10s for c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f to power off
Jul 30 2017 07:53:38.757Z DEBUG vSphere Event Guest operation Start Program performed on Virtual machine vigilant_colden-c0c74a864521. for eventID(1014) ignored by the event collector
Jul 30 2017 07:53:47.800Z WARN  timeout (10s) waiting for c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f to power off via SIGTERM
Jul 30 2017 07:53:47.800Z INFO  sending kill -KILL c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f
Jul 30 2017 07:53:47.800Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.(*containerBase).startGuestProgram:180] c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f:kill
Jul 30 2017 07:53:47.836Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.(*containerBase).startGuestProgram:180] [35.366927ms] c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f:kill
Jul 30 2017 07:53:47.836Z WARN  stopping c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f via hard power off due to: sending kill -KILL c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f: ServerFaultCode: The attempted operation cannot be performed in the current state (Powered off).
Jul 30 2017 07:53:47.836Z DEBUG op=302.48 (delta:3.134µs): [NewOperation] op=302.48 (delta:1.554µs) [github.com/vmware/vic/pkg/vsphere/tasks.WaitForResult:65]
Jul 30 2017 07:53:47.893Z ERROR op=302.48 (delta:56.677859ms): unexpected fault on task retry : &types.InvalidPowerState{InvalidState:types.InvalidState{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}, RequestedState:"poweredOn", ExistingState:"poweredOff"}
Jul 30 2017 07:53:47.905Z WARN  power off c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f task skipped (state was already poweredOff)
Jul 30 2017 07:53:47.905Z DEBUG Set container c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f state: Stopped
... ...
Jul 30 2017 07:53:47.925Z INFO  Reconfigure: attempting update to c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f with change version "2017-07-30T07:53:08.255818Z" (poweredOn)
Jul 30 2017 07:53:47.925Z WARN  Reconfigure: collision of concurrent operations - expected power state poweredOff, found poweredOn
Jul 30 2017 07:53:47.925Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.Commit:36] [10.127326615s] c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f
Jul 30 2017 07:53:47.925Z ERROR CommitHandler error on handle(cbe83ee3f59c480f8ea29b74e0ca63c2) for c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f: exec.ConcurrentAccessError{err:(*errors.errorString)(0xc4205fca60)}
Jul 30 2017 07:53:47.925Z DEBUG [ END ] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).CommitHandler:216] [10.127755392s] handle(cbe83ee3f59c480f8ea29b74e0ca63c2)
Jul 30 2017 07:53:47.942Z DEBUG Found 2 subscribers to id: 0 - events.ContainerEvent: Container c0c74a864521e42c3d474379446016530e1a6cf988b23613906b571cd8dbe31f Stopped

Here we observe the same error msg (invalid state) as the above.

@chengwang86 chengwang86 changed the title Nightly Test 5-6-1- VSAN Simple - container does not exit Nightly Test 5-6-1- VSAN Simple - container does not exit (invalidstate) Aug 14, 2017
@chengwang86
Copy link
Contributor

chengwang86 commented Aug 14, 2017

It seems that in both failures reported above, the containerVMs were in the invalidstate when performing docker stop.

Here is a comparison of expected behavior vs. failure behaivor when performing

docker create busybox bin/top
docker start container
docker stop container

During a successful docker stop (no invalid state):

  • from docker personality log: INFO Sending die event for container 43bcd416a3b0b5380f41535130faf8298fd2a4906db0916e85f20ccba6b86360 - code: 143
  • docker ps -a shows Exited (143)

During the failed docker stop (invalid state):

  • from docker personality log: INFO Sending die event for container b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36 - code: 0
  • docker ps -a shows Stopped

I guess this might be because we didn't handle the containerVM's status and exitcode appropriately if it is in the invalid state.

@chengwang86 chengwang86 changed the title Nightly Test 5-6-1- VSAN Simple - container does not exit (invalidstate) Nightly Test 5-6-1- VSAN Simple - container does not exit (InvalidPowerState) Aug 14, 2017
@chengwang86
Copy link
Contributor

Discussed with @emlin .

There are two issues here:

  • the exit code should be 143 instead of 0 if the containerVM is stopped by sending SIGTERM. The exit code is set by tether. So we need to figure out why tether sets 0 in this scenario, or why the docker persona assumes that the exit code is 0 if tether does its job correctly.
  • why does docker ps -a shows Stopped

@chengwang86
Copy link
Contributor

Here is the vpxd log snippet (when trying to stop the container)

2017-07-23T02:50:12.242Z info vpxd[7F39C54DF700] [Originator@6876 sub=vpxLro opID=1e183d09] [VpxLRO] -- BEGIN lro-2542 -- guestOperationsProcessManager -- vim.vm.guest.ProcessManager.startProgram -- 52aee279-7127-386f-855e-bd3ed68e7f21(5204d911-83a7-1141-8c93-29cee4de369f)
2017-07-23T02:50:13.306Z info vpxd[7F39C54DF700] [Originator@6876 sub=vpxLro opID=1e183d09] [VpxLRO] -- FINISH lro-2542
2017-07-23T02:50:13.306Z info vpxd[7F39C54DF700] [Originator@6876 sub=Default opID=1e183d09] [VpxLRO] -- ERROR lro-2542 -- guestOperationsProcessManager -- vim.vm.guest.ProcessManager.startProgram: vmodl.fault.SystemError:
--> Result:
--> (vmodl.fault.SystemError) {
-->    faultCause = (vmodl.MethodFault) null, 
-->    faultMessage = <unset>, 
-->    reason = "vix error codes = (3016, 0).
--> "
-->    msg = "Received SOAP response fault from [<cs p:00007f39ac4550e0, TCP:10.160.212.141:443>]: startProgram
--> Received SOAP response fault from [<cs p:066f6de0, TCP:localhost:8307>]: startProgram
--> A general system error occurred: vix error codes = (3016, 0).
--> "
--> }
--> Args:
--> 
--> Arg vm:
--> 'vim.VirtualMachine:vm-2'
--> Arg auth:
--> (vim.vm.guest.NamePasswordAuthentication) {
-->    interactiveSession = false, 
-->    username = "b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36", 
-->    password = (not shown)
--> }
--> Arg spec:
--> (vim.vm.guest.ProcessManager.ProgramSpec) {
-->    programPath = "kill", 
-->    arguments = "TERM", 
-->    workingDirectory = <unset>, 
-->    envVariables = <unset>
--> }
2017-07-23T02:50:13.326Z info vpxd[7F39C6C8E700] [Originator@6876 sub=vpxLro opID=494f294e] [VpxLRO] -- BEGIN task-157 -- vm-60 -- vim.VirtualMachine.powerOff -- 52aee279-7127-386f-855e-bd3ed68e7f21(5204d911-83a7-1141-8c93-29cee4de369f)
2017-07-23T02:50:13.480Z warning vpxd[7F39C6C8E700] [Originator@6876 sub=vmmoVm opID=494f294e] [vm.powerOff] Received InvalidPowerState exception with existingState, 0
2017-07-23T02:50:13.481Z info vpxd[7F39C6C8E700] [Originator@6876 sub=vpxLro opID=494f294e] [VpxLRO] -- FINISH task-157
2017-07-23T02:50:13.481Z info vpxd[7F39C6C8E700] [Originator@6876 sub=Default opID=494f294e] [VpxLRO] -- ERROR task-157 -- vm-60 -- vim.VirtualMachine.powerOff: vim.fault.InvalidPowerState:
--> Result:
--> (vim.fault.InvalidPowerState) {
-->    faultCause = (vmodl.MethodFault) null, 
-->    faultMessage = (vmodl.LocalizableMessage) [
-->       (vmodl.LocalizableMessage) {
-->          key = "vpxd.vm.poweroff.unexpectedfailure", 
-->          arg = (vmodl.KeyAnyValue) [
-->             (vmodl.KeyAnyValue) {
-->                key = "1", 
-->                value = "frosty_mahavira-b211861dc9b4"
-->             }
-->          ], 
-->          message = "An error was received from the ESX host while powering off VM frosty_mahavira-b211861dc9b4."
-->       }
-->    ], 
-->    requestedState = "poweredOn", 
-->    existingState = "poweredOff"
-->    msg = "The attempted operation cannot be performed in the current state (Powered off)."
--> }
--> Args:
--> 

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 15, 2017

I can partially reproduce this issue by taking the following steps (same output behavior but not all portlayer error msgs):

- docker run -d busybox bin/top
- on vsphere web client, power of this containerVM
- docker ps -a

Then I see that the containerVM's status is marked as Stopped.

The above steps reproduce the InvalidPowerState error wherein the expected power state is poweredon whereas the actual power state is poweredoff.

Here is the code snippet where we power off the containerVM

func (c *containerBase) poweroff(ctx context.Context) error {
	... ...
	_, err := c.vm.WaitForResult(ctx, func(ctx context.Context) (tasks.Task, error) {
		return c.vm.PowerOff(ctx)
	})

	if err != nil {
		// It is possible the VM has finally shutdown in between, ignore the error in that case
		if terr, ok := err.(task.Error); ok {
			switch terr := terr.Fault().(type) {
			case *types.InvalidPowerState:
				if terr.ExistingState == types.VirtualMachinePowerStatePoweredOff {
					log.Warnf("power off %s task skipped (state was already %s)", c.ExecConfig.ID, terr.ExistingState)
					return nil
				}
				log.Warnf("invalid power state during power off: %s", terr.ExistingState)

			case *types.GenericVmConfigFault:

				// Check if the poweroff task was canceled due to a concurrent guest shutdown
				if len(terr.FaultMessage) > 0 {
					k := terr.FaultMessage[0].Key
					if k == vmNotSuspendedKey || k == vmPoweringOffKey {
						log.Infof("power off %s task skipped due to guest shutdown", c.ExecConfig.ID)
						return nil
					}
				}
				log.Warnf("generic vm config fault during power off: %#v", terr)

			default:
				log.Warnf("hard power off failed due to: %#v", terr)
     ... ...

#5803 relates to the first case. #5981 relates to the second case.

However, here I suspect that c.vm.runtime.PowerState is not set to poweredoff even in the first case, so the subsequent reconfigure operation in commit.go failed with Reconfigure: collision of concurrent operations - expected power state poweredOff, found poweredOn.

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 15, 2017

Update:

  • (1) If a containerVM is powered off OOB (e.g., from vsphere web client), its stoptime=0 and exitcode=0. So the backend code will set the status to Stopped instead of Exited(). So maybe we need to add code here https://github.com/vmware/vic/blob/master/lib/portlayer/exec/container.go#L678 to update the containerVM's stoptime and exitcode (if possible) in guestinfo. But I suspect that tether is the only one that is responsible for updating these values. @cgtexmex What do you think?

  • (2) In this issue, during docker stop, our log shows that the containerVM is already poweredoff when we call StartProgram() to stop the containerVM via kill -s SIGTERM . So we see this Set container container state: Stopped in our log. Later on we get another vsphere event for event PoweredOff received. However, since the containerVM is already marked as stopped, this vsphere event is not considered as OOB operation. Therefore, event if (1) is fixed, (2) may still need extra work.

@sflxn
Copy link
Contributor

sflxn commented Aug 15, 2017

This could also be related to #5629 since the stop could cause the VM to shutdown and we detect the powerstate before we read the exit code.

@cgtexmex
Copy link
Contributor

@chengwang86 yes tether has the responsibility for setting the exitcode & stoptime and we have thus far resisted setting those values via the portLayer, etc...

@chengwang86
Copy link
Contributor

@cgtexmex Can we do a commit in this case since the stoptime is also set here https://github.com/vmware/vic/blob/master/lib/portlayer/exec/handle.go#L210

@cgtexmex
Copy link
Contributor

@chengwang86 we could but we'd have to do the update first (get any values that were written by tether), ensure that the stop / exitcode are missing and then set and commit. We are clearly not doing that now and I can't recall the reasoning...

@mdubya66 mdubya66 modified the milestones: Sprint 15, Sprint 14 Aug 15, 2017
@chengwang86
Copy link
Contributor

According to my discussion with @cgtexmex , the following is the workflow on the PL code path for container stop:

    1. PL commit triggered by backend commitcontainerhandle() in containerstop(). Here stop() function is called and the container's power state is already powered off.
    1. refresh(). After this refresh(), the handle's runtime.powerstate is still powered on.
    1. The subsequent reconfigure operation in commit() failed due to Reconfigure: collision of concurrent operations - expected power state poweredOff, found poweredOn.
    1. so the stoptime and exitcode (if the values are correctly obtained) are not persisted to the vmx file since the PL still assumes that the container is powered on.
    1. when we do docker ps -a, the backend finds that stoptime==0 and exitcode==0; so the output status is set to Stopped.

We need to add more log information to the PL stop code path to figure out why the handle's power state is still powered on after refresh() in ii

@sflxn
Copy link
Contributor

sflxn commented Aug 15, 2017

Guys, there is a strong chance the issue I mentioned, #5629 is related to this. The tether wrote out the status but we're not reading that exit code. In 5629, the property change status checks for powerstate change first so if the vm powers off before the property change is check, it will notice the power state is off and ignore the exit code.

The fix for 5629 was merged so we'll see tonight if some of these "stopped" and not "exit" issues starts to go away.

@cgtexmex
Copy link
Contributor

@sflxn maybe but I don't believe we are hitting the code you changed with #5938 . This specific instance is a known started container (docker start) that is stopped via docker stop not correctly reporting the exit code or time.

@sflxn
Copy link
Contributor

sflxn commented Aug 16, 2017

@cgtexmex I have an issue in progress, #5981 that has the same error in the portlayer. As I look at the code, I think you're right. I didn't read @chengwang86 last summary of your discussion before I dug into the code and came exactly to the same spot. I also figured out how this code works as described in the summary the hard way.

@sflxn
Copy link
Contributor

sflxn commented Aug 17, 2017

From hostd:

--> (vmodl.fault.SystemError) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    reason = "vix error codes = (3016, 0).
--> "
-->    msg = ""
--> }
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: capability, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.guestOperationsReady, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.guestStateChangeSupported, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 1. Sent notification immediately.
2017-07-23T02:50:13.179Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] Send config update invoked
2017-07-23T02:50:13.246Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] Checking for all objects accessibility (VM's current state: VM_STATE_OFF, stable? true)
2017-07-23T02:50:13.246Z info hostd[11D80B70] [Originator@6876 sub=VmObjectStorageMonitor] Object UUID 830e7459-eea8-6efa-62d7-02001c332c85 APD state: Healthy
2017-07-23T02:50:13.247Z info hostd[11D80B70] [Originator@6876 sub=VmObjectStorageMonitor] Object UUID 980e7459-8b2a-6dc8-b439-02001c332c85 APD state: Healthy
2017-07-23T02:50:13.247Z info hostd[11D80B70] [Originator@6876 sub=VmObjectStorageMonitor] Object UUID 9f0e7459-46a5-2897-f696-02001c332c85 APD state: Healthy
2017-07-23T02:50:13.247Z info hostd[11D80B70] [Originator@6876 sub=VmObjectStorageMonitor] Object UUID b50e7459-8015-c354-88e1-02001cd8bed9 APD state: Healthy
2017-07-23T02:50:13.247Z info hostd[11D80B70] [Originator@6876 sub=VmObjectStorageMonitor] Object UUID b90e7459-b1d4-023e-a1ea-02001cd8bed9 APD state: Healthy
2017-07-23T02:50:13.270Z info hostd[11D80B70] [Originator@6876 sub=Libs] OBJDESC-LIB :  Unable to the read the descriptor file '' (No such file or directory)
2017-07-23T02:50:13.270Z warning hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] Failed to get vm swap objectId, '': No such file or directory.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: layoutEx.file[9], 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: layoutEx.file[8], 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.file[7].name, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.file[7].type, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.file[7].size, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.file[7].uniqueSize, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.file[7].backingObjectId, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layoutEx.timestamp, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: layout.swapFile, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: storage, 1. Sent notification immediately.
2017-07-23T02:50:13.270Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.storage, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: config.extraConfig["guestinfo.vice..networks|bridge.network.assigned.gateway.Mask"], 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: config.extraConfig["guestinfo.vice..networks|bridge.assigned.IP"], 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: config.extraConfig["guestinfo.vice..networks|bridge.network.assigned.gateway.IP"], 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: config.extraConfig["guestinfo.vice..networks|bridge.assigned.Mask"], 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9002].connectable.connected, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9002].connectable.status, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9000].connectable.status, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9001].connectable.connected, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9001].connectable.status, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[4000].connectable.connected, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[4000].connectable.status, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[3000].connectable.connected, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[3000].connectable.status, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.cpuAllocation.expandableReservation, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.memoryAllocation.expandableReservation, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.memoryAllocation.overheadLimit, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hotPlugMemoryLimit, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hotPlugMemoryIncrementSize, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.extraConfig["softPowerOff"].value, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.vmxConfigChecksum, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.swapStorageObjectId, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] Updated active set of monitored objects for VM '1': /vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx (5 objects)
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[3000].deviceInfo, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[4000].deviceInfo, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9000].deviceInfo, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9001].deviceInfo, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.hardware.device[9002].deviceInfo, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: resourceConfig.cpuAllocation.expandableReservation, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: resourceConfig.memoryAllocation.expandableReservation, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: resourceConfig.memoryAllocation.overheadLimit, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: runtime.device, 1. Sent notification immediately.
2017-07-23T02:50:13.271Z verbose hostd[10F40B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: licensableResource, ha-host. Sent notification immediately.
2017-07-23T02:50:13.271Z info hostd[10DFEB70] [Originator@6876 sub=Vimsvc.TaskManager opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Task Created : haTask-1-vim.VirtualMachine.powerOff-169261726
2017-07-23T02:50:13.272Z verbose hostd[10DFEB70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ADD: recentTask["haTask-1-vim.VirtualMachine.powerOff-169261726"], 1. Sent notification immediately.
2017-07-23T02:50:13.272Z verbose hostd[10DFEB70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ADD: recentTask["haTask-1-vim.VirtualMachine.powerOff-169261726"], ha-taskmgr. Applied change to temp map.
2017-07-23T02:50:13.272Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: info, haTask-1-vim.VirtualMachine.powerOff-169261726. Applied change to temp map.
2017-07-23T02:50:13.279Z verbose hostd[119C1B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb2b user=vpxuser] RecordOp ASSIGN: info, haTask-ha-host-vim.HostSystem.queryOverheadEx-169261727. Applied change to temp map.
2017-07-23T02:50:13.279Z verbose hostd[119C1B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb2b user=vpxuser] RecordOp ASSIGN: info, haTask-ha-host-vim.HostSystem.queryOverheadEx-169261727. Applied change to temp map.
2017-07-23T02:50:13.304Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: runtime.featureRequirement, 1. Sent notification immediately.
2017-07-23T02:50:13.304Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 1. Sent notification immediately.
2017-07-23T02:50:13.328Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to success
2017-07-23T02:50:13.329Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.extraConfig, 1. Sent notification immediately.
2017-07-23T02:50:13.329Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.
2017-07-23T02:50:13.329Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp REMOVE: latestPage[158], session[5296c1f7-7c57-6b64-2dd0-3663abb64500]52480f30-1f71-e831-aaf7-8691441f1553. Applied change to temp map.
2017-07-23T02:50:13.329Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ADD: latestPage[168], session[5296c1f7-7c57-6b64-2dd0-3663abb64500]52480f30-1f71-e831-aaf7-8691441f1553. Applied change to temp map.
2017-07-23T02:50:13.329Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: latestEvent, ha-eventmgr. Applied change to temp map.
2017-07-23T02:50:13.329Z info hostd[11340B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Event 168 : frosty_mahavira-b211861dc9b4 on  sc-rdops-vm07-dhcp-212-141.eng.vmware.com in ha-datacenter is stopping
2017-07-23T02:50:13.329Z error hostd[11340B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Invalid transition requested (VM_STATE_OFF -> VM_STATE_POWERING_OFF): Invalid state
2017-07-23T02:50:13.329Z info hostd[11340B70] [Originator@6876 sub=Default opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] AdapterServer caught exception: vim.fault.InvalidPowerState
2017-07-23T02:50:13.329Z info hostd[11340B70] [Originator@6876 sub=Vimsvc.TaskManager opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Task Completed : haTask-1-vim.VirtualMachine.powerOff-169261726 Status error
2017-07-23T02:50:13.329Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: info, haTask-1-vim.VirtualMachine.powerOff-169261726. Applied change to temp map.
2017-07-23T02:50:13.330Z verbose hostd[11381B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.runtime, ha-root-pool. Sent notification immediately.
2017-07-23T02:50:13.337Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb2e user=vpxuser] RecordOp ASSIGN: info, haTask-ha-host-vim.HostSystem.queryOverheadEx-169261728. Applied change to temp map.
2017-07-23T02:50:13.338Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb2e user=vpxuser] RecordOp ASSIGN: info, haTask-ha-host-vim.HostSystem.queryOverheadEx-169261728. Applied change to temp map.
2017-07-23T02:50:13.343Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb30 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.event.EventHistoryCollector.readNext-169261729. Applied change to temp map.
2017-07-23T02:50:13.343Z verbose hostd[11340B70] [Originator@6876 sub=PropertyProvider opID=d5a9bb30 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.event.EventHistoryCollector.readNext-169261729. Applied change to temp map.
2017-07-23T02:50:13.344Z verbose hostd[10DFEB70] [Originator@6876 sub=PropertyProvider opID=d5a9bb31 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.event.EventHistoryCollector.readNext-169261730. Applied change to temp map.
2017-07-23T02:50:13.344Z verbose hostd[10DFEB70] [Originator@6876 sub=PropertyProvider opID=d5a9bb31 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.event.EventHistoryCollector.readNext-169261730. Applied change to temp map.
2017-07-23T02:50:13.347Z verbose hostd[10FC2B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: info, haTask--vim.vm.NamespaceManager.queryNamespace-169261731. Applied change to temp map.
2017-07-23T02:50:13.348Z info hostd[10FC2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_NAMESPACE_CHANGE)
2017-07-23T02:50:13.348Z verbose hostd[10FC2B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.
2017-07-23T02:50:13.349Z error hostd[10FC2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] Namespace 'ovf' not found.
2017-07-23T02:50:13.349Z info hostd[10FC2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_NAMESPACE_CHANGE -> VM_STATE_OFF)
2017-07-23T02:50:13.349Z verbose hostd[10FC2B70] [Originator@6876 sub=PropertyProvider opID=494f294e-aa-bb34 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: disabledMethod, 1. Sent notification immediately.

Seems the VM's power state is 'VM_STATE_OFF' but when we do a refresh, we get the state as powered on. Not really sure why this can be happening. Maybe @dougm may have some ideas.

We report in the portlayer logs:

WARN  Reconfigure: collision of concurrent operations - expected power state poweredOff, found poweredOn

The hostd logs doesn't seem to indicate any problems with reconfiguring. The VM seems to be powered off.

@sflxn
Copy link
Contributor

sflxn commented Aug 17, 2017

The code that is preventing us from hitting WaitForResult() to get the exit code was added a month ago in #5445 to fix another edge case. I think we can do a few things:

  • Find out why refresh is returning poweredon
  • Try to do a WaitForResult to attempt to get the exit code even if the powerstate does not match
  • Change the regression test to look for "stopped" OR "exited"
    The two have different meaning, but when a user execute docker stop, does he expect to see exited?

The best fix is the first one. There might be another edge case that might make the second one problematic.

@sflxn sflxn assigned sflxn and unassigned chengwang86 Aug 17, 2017
@mhagen-vmware
Copy link
Contributor

Hit this on longevity last night as well:
container-logs.zip

@sflxn
Copy link
Contributor

sflxn commented Aug 18, 2017

I've broken down the hostd of this issue. The VM has powered off BEFORE VC thinks the kill program doesn't exist. After talking to @dougm a few days ago, I think the're shutting down before VC detects guest tools. It appears to call it anyway and is successful. VC returns 3016, we attempt to perform poweroff while the VM is already powered off. VIC then proceeds to do a refresh and gets the wrong state value. This is the central problem. Why are we getting an incorrect powerstate value when the VM is clearly powered off?

Here are choice snippets from the logs:

# Why are stop and status being written to before kill is kicked off?
---------------------------------------------------------------------
2017-07-23T02:50:13.144Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.extraConfig["guestinfo.vice..sessions|b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36.status"].value, 1. Sent notification immediately.
2017-07-23T02:50:13.144Z verbose hostd[11D80B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config.extraConfig["guestinfo.vice..sessions|b211861dc9b40f2feea29a0dc0fc03563062d71e10da843d8d07447390deab36.detail.stoptime"].value, 1. Sent notification immediately.

# Start kill program
--------------------
2017-07-23T02:50:13.144Z info hostd[11D80B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 167 : Guest operation Start Program performed on Virtual machine frosty_mahavira-b211861dc9b4.

# VM is off
-----------
2017-07-23T02:50:13.161Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] State Transition (VM_STATE_GUEST_OPERATION -> VM_STATE_OFF)

# Believes kill failed even when VM is powered off above
--------------------------------------------------------
2017-07-23T02:50:13.178Z info hostd[11D80B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx] [N8Guestsvc23StartProgramRequestImplE:0x11a1c600] opCode=4 auth=<hidden> programPath=kill arguments=TERM failed

# Poweroff request
------------------
2017-07-23T02:50:13.271Z info hostd[10DFEB70] [Originator@6876 sub=Vimsvc.TaskManager opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Task Created : haTask-1-vim.VirtualMachine.powerOff-169261726

# Reconfiguring during power off
--------------------------------
2017-07-23T02:50:30.968Z info hostd[10F81B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=HB-SpecSync-host-21@45-49873338-f1-bb3e user=vpxuser:vpxuser] State Transition (VM_STATE_OFF -> VM_STATE_RECONFIGURING)

There is nothing concurrent happening at all! I think this might be a bug in VC (that we might be able to work around). I think because we tried to kick off the kill before VC has acknowledged guest tools, it gets confused and returns incorrect state data. I'll look at our code again to make sure it's not us. @dougm mentioned he is going to add some code to check if VC has seen guest tools. He's doing it for us to call in our CI tests. I think we should also call the same code before we try to execute a toolbox function (e.g. start kill program).

One last comment. We report concurrent access error in our portlayer logs. That is a made up error inside of the portlayer. That is not an error from VC.

@sflxn
Copy link
Contributor

sflxn commented Aug 18, 2017

One last thought. I see in the portlayer log that when we do an update, the change version after the stop is the same change version from after we started the container.

@sflxn
Copy link
Contributor

sflxn commented Aug 21, 2017

2017-07-23T02:50:13.329Z error hostd[11340B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vsan:5285502b216bf7ad-ecd043de4ead6bc5/b50e7459-8015-c354-88e1-02001cd8bed9/frosty_mahavira-b211861dc9b4.vmx opID=494f294e-aa-bb2a user=vpxuser:VSPHERE.LOCAL\Administrator] Invalid transition requested (VM_STATE_OFF -> VM_STATE_POWERING_OFF): Invalid state

The above is the log from hostd after we attempt a poweroff operation. We only perform the poweroff because we believe the kill operation failed, but it actually succeeded. This supports the theory that since VC hasn't established the existence of the toolbox yet, successful operations (e.g. startguestprogram kill) fails to get registered and VC instead reports what it thinks is the toolbox not running.

Then when the poweroff attempts to power off the machine from an already powered off state, it sees an invalid transition and attempts to correct it's internal state machine. I believe it is around this time that we do a refresh and get an invalid state.

sflxn pushed a commit to sflxn/vic that referenced this issue Aug 21, 2017
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram
returns an error 3016 even if the toolbox executed and the VM was powered off.  When this
happens, we try to power off the VM, throwing vSphere into an invalid state transition, which
it needs to recover from.  During this time, we have seen refreshing properties return invalid
powerstate value and prevents us from reading the exit code in portlayer's commit code.

To prevent this, we now wait to see if the container VM powered off after startGuestProgram,
regardless if it returns an error.

Resolves vmware#5803
@sflxn
Copy link
Contributor

sflxn commented Aug 22, 2017

I put out a PR #6077 that attempts to solve this problem back at the kill operation. Instead of only waiting for power state if startGuestProgram returns nil. I now wait for power state regardless. Hopefully, this will prevent the poweroff operation from executing when the VM is already off.

sflxn pushed a commit to sflxn/vic that referenced this issue Aug 22, 2017
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram
returns an error 3016 even if the toolbox executed and the VM was powered off.  When this
happens, we try to power off the VM, throwing vSphere into an invalid state transition, which
it needs to recover from.  During this time, we have seen refreshing properties return invalid
powerstate value and prevents us from reading the exit code in portlayer's commit code.

To prevent this, we now wait to see if the container VM powered off after startGuestProgram,
regardless if it returns an error.

Resolves vmware#5803
sflxn pushed a commit to sflxn/vic that referenced this issue Aug 22, 2017
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram
returns an error 3016 even if the toolbox executed and the VM was powered off.  When this
happens, we try to power off the VM, throwing vSphere into an invalid state transition, which
it needs to recover from.  During this time, we have seen refreshing properties return invalid
powerstate value and prevents us from reading the exit code in portlayer's commit code.

To prevent this, we now wait to see if the container VM powered off after startGuestProgram,
regardless if it returns an error.

Resolves vmware#5803
sflxn pushed a commit to sflxn/vic that referenced this issue Aug 22, 2017
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram
returns an error 3016 even if the toolbox executed and the VM was powered off.  When this
happens, we try to power off the VM, throwing vSphere into an invalid state transition, which
it needs to recover from.  During this time, we have seen refreshing properties return invalid
powerstate value and prevents us from reading the exit code in portlayer's commit code.

To prevent this, we now wait to see if the container VM powered off after startGuestProgram,
regardless if it returns an error.

Resolves vmware#5803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0
Projects
None yet
Development

No branches or pull requests

7 participants