Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State race when updating container config #5445

Closed
hickeng opened this issue Jun 12, 2017 · 7 comments · Fixed by #5532
Closed

State race when updating container config #5445

hickeng opened this issue Jun 12, 2017 · 7 comments · Fixed by #5532
Assignees
Labels
component/portlayer/execution kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary
Milestone

Comments

@hickeng
Copy link
Member

hickeng commented Jun 12, 2017

From: #4872

We have observed one incidence in the logs of a container having its state damaged.
From the logs we can see multiple concurrent operations:

  1. an in-guest shutdown resulting from explicit invocation of exit in the shell
  2. a stop operation
  3. a restart operation

My hypothesis is that the powerOn from the restart (3) is occuring just after the powerOff operation from (2) has succeeded, but before the container configuration is updated with the revised StoppedTime.
This results in us using a full config as that's based on the target power state, not the non-persistent portion suited to a powered on cVM. This hypothesis was generated by code inspection with input from the logs.

@cgtexmex and I have been unable to recreate this failure remotely reliably with released codebase, so I have injected code to manipulate the timings. This has resulted in the expected SetVolatileEntry lines in the container vmware.log.

In specific, this adds a sleep of 5s before https://github.com/vmware/vic/blob/master/lib/portlayer/exec/commit.go#L119 and races:

  1. docker stop -t 1 id
  2. docker start id

hickeng update: this can be mostly addressed in VIC code, but not completely without bug1898149 - I will let this issue get closed by the work around, and keep the platform issue open.

@hickeng hickeng added component/portlayer/execution kind/defect Behavior that is inconsistent with what's intended kind/bug/no-repro source/customer Reported by a customer, directly or via an intermediary labels Jun 12, 2017
@hickeng hickeng self-assigned this Jun 12, 2017
@hickeng hickeng added this to the Sprint 11 milestone Jun 12, 2017
@hickeng
Copy link
Member Author

hickeng commented Jun 12, 2017

Update:
Tight running of the following two commands in separate shells is managing to cause the missing config, even with speculative fix. However the config is being restored, presumably because it's already been captured in the handle of the contending operation.
Will continue to look for path that causes this. May need to fix #5388 first as that's impacting diagnostics.

docker create -dit --name=test busybox

while true; do
  docker restart test
done &

while true' do
  docker start test
  docker stop test
done

@hickeng
Copy link
Member Author

hickeng commented Jun 19, 2017

Log bundle with portlayer and hostd logs that capture the time of failure. Took a quick scan of hostd but didn't see obvious issue, barring the fact that the ChangeVersions in concurrent access reports from hostd don't show up in the portlayer log. Will ensure those are logged.
state-race.zip

Considering a two pronged approach to this:

  1. continue debugging what's occurring as captured in this issue.
  2. address [portlayer] reconfigure call scales linearly #4922 whereby we limit the ExtraConfig entries to only those that have changed. This is more of a workaround for this issue, but should mostly address the observed problem.

@hickeng
Copy link
Member Author

hickeng commented Jun 19, 2017

Managed to recreate locally with trivia logging for vimsvc and vmsvc. hostd:
hostd.0.gz
vmware.log for containerVM

Initial item of interest is the fact that we see the read-only keys being deleted here:

2017-06-19T16:10:30.737Z verbose hostd[FFE34A80] [Originator@6876 sub=PropertyProvider] RecordOp REMOVE: config.extraConfig["guestinfo.vice./sessions|5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d/User"], 1571. Sent notification immediately.

which is the result of us sending a full ExtraConfig during this reconfigure when the VM is on (change version: 2017-06-19T16:10:26.408734Z):

2017-06-19T16:10:27.103Z info hostd[738E8B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/56a2eeef-3cafae6e-2bd3-000c29b00df4/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d.vmx opID=33777692 user=drone] State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)

The specific point of interest is that the Handle we have in the port layer with that change version thinks the power state is off:

Jun 19 2017 16:10:26.877Z INFO  Reconfigure: attempting update to 5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d with change version 2017-06-19T16:10:26.408734Z (poweredOff)

Hypotheses:

  1. vSphere is not incrementing the changeversion for power operation (or not atomically with the op)
  2. RetrieveProperties is returning torn data (i.e. changeversion and power state are not consistent)

@hickeng
Copy link
Member Author

hickeng commented Jun 19, 2017

Checking ChangeVersion with default VM created by govc - we can see that power on updates ChangeVersion but power off does not.

vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc vm.create -net="VM Network" -on=false tester
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc object.collect '/ha-datacenter/vm/tester' 'config.changeVersion'
config.changeVersion  string  2017-06-19T18:03:20.716543Z
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc vm.power -on tester
Powering on VirtualMachine:1574... OK
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc object.collect '/ha-datacenter/vm/tester' 'config.changeVersion'
config.changeVersion  string  2017-06-19T18:03:33.684303Z
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc vm.power -off tester
Powering off VirtualMachine:1574... OK
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc object.collect '/ha-datacenter/vm/tester' 'config.changeVersion'
config.changeVersion  string  2017-06-19T18:03:33.684303Z
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc vm.power -on tester
Powering on VirtualMachine:1574... OK
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$ govc object.collect '/ha-datacenter/vm/tester' 'config.changeVersion'
config.changeVersion  string  2017-06-19T18:03:51.121467Z
vagrant@devbox:~/vicsmb/src/github.com/vmware/vic$

@hickeng
Copy link
Member Author

hickeng commented Jun 19, 2017

Ordering of operations:

power on dispatch:

2017-06-19T16:10:26.705Z info hostd[73E91B70] [Originator@6876 sub=Vimsvc.TaskManager opID=33777681 user=drone] Task Created : haTask-1571-vim.VirtualMachine.powerOn-251

Getting runtime state for update of Handle:

2017-06-19T16:10:26.997Z trivia hostd[73A9FB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/56a2eeef-3cafae6e-2bd3-000c29b00df4/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d.vmx opID=3377768d user=drone] Retrieved runtime information for vm: 7

power on success:

2017-06-19T16:10:27.064Z info hostd[73E91B70] [Originator@6876 sub=Vimsvc.TaskManager] Task Completed : haTask-1571-vim.VirtualMachine.powerOn-251 Status success

reconfigure dispatch:

2017-06-19T16:10:27.094Z info hostd[73A9FB70] [Originator@6876 sub=Vimsvc.TaskManager opID=33777692 user=drone] Task Created : haTask-1571-vim.VirtualMachine.reconfigure-263

start of volatile entry updates:

2017-06-19T16:10:27.133Z trivia hostd[73A9FB70] [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/56a2eeef-3cafae6e-2bd3-000c29b00df4/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d/5d29b562b2c49c03c18a78e2d6ae8e1a132e00441172260ad0de24f6e9a71d4d.vmx opID=33777692 user=drone] op "Set VMX volatile raw config entry" calling VigorOnlineRPC_ConfigParamsSetVolatileEntry

reconfigure success:

2017-06-19T16:10:28.784Z info hostd[73A9FB70] [Originator@6876 sub=Vimsvc.TaskManager opID=33777692 user=drone] Task Completed : haTask-1571-vim.VirtualMachine.reconfigure-263 Status success

@hickeng
Copy link
Member Author

hickeng commented Jun 19, 2017

Added #4922 as that should be addressed by the same workaround - modifying the reconfigure spec to only include changed keys.

@hickeng
Copy link
Member Author

hickeng commented Jun 20, 2017

PR is in WIP because has dependencies on other unmerged PRs. Moving to verify anyway for ci input.

@mdubya66 mdubya66 modified the milestones: Sprint 12, Sprint 11 Jun 20, 2017
hickeng added a commit that referenced this issue Jul 10, 2017
* Updates extraconfig logging config
* Fixes syslogd configuration and reduces some log verbosity
* Adds support for delta sets of extraconfig keys
* Stops overwriting exit code on CLI stop call
* Generate stop event after state refresh to get consistent state
* Adjusts the syslog queue size to deal with excessive overflows.
* Correct handle refreshFromHandle behaviour on the power off path.
* Disables the exit code check for migrated containers until #5653 is
addressed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/portlayer/execution kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants