-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly Test 5-6-1- VSAN Simple - container does not exit (InvalidPowerState) #5803
Comments
Seen in 13-1 vMotion VCH Appliance 6.5 7/30/17 nightly report: |
From 5-1-VSAN PL log:
This seems a bit different from #5981 (comment) wherein the error msg is
However, in this ticket the error msg is (inivalidpowerstate):
|
From 13-1-vMotion-VCH PL log:
Here we observe the same error msg (invalid state) as the above. |
It seems that in both failures reported above, the containerVMs were in the Here is a comparison of expected behavior vs. failure behaivor when performing
During a successful
During the failed
I guess this might be because we didn't handle the containerVM's status and exitcode appropriately if it is in the invalid state. |
Discussed with @emlin . There are two issues here:
|
Here is the vpxd log snippet (when trying to stop the container)
|
I can partially reproduce this issue by taking the following steps (same output behavior but not all portlayer error msgs):
Then I see that the containerVM's status is marked as The above steps reproduce the Here is the code snippet where we power off the containerVM
#5803 relates to the first case. #5981 relates to the second case. However, here I suspect that |
Update:
|
This could also be related to #5629 since the stop could cause the VM to shutdown and we detect the powerstate before we read the exit code. |
@chengwang86 yes tether has the responsibility for setting the |
@cgtexmex Can we do a |
@chengwang86 we could but we'd have to do the update first (get any values that were written by tether), ensure that the stop / exitcode are missing and then set and commit. We are clearly not doing that now and I can't recall the reasoning... |
According to my discussion with @cgtexmex , the following is the workflow on the PL code path for container stop:
We need to add more log information to the PL stop code path to figure out why the handle's power state is still |
Guys, there is a strong chance the issue I mentioned, #5629 is related to this. The tether wrote out the status but we're not reading that exit code. In 5629, the property change status checks for powerstate change first so if the vm powers off before the property change is check, it will notice the power state is off and ignore the exit code. The fix for 5629 was merged so we'll see tonight if some of these "stopped" and not "exit" issues starts to go away. |
@cgtexmex I have an issue in progress, #5981 that has the same error in the portlayer. As I look at the code, I think you're right. I didn't read @chengwang86 last summary of your discussion before I dug into the code and came exactly to the same spot. I also figured out how this code works as described in the summary the hard way. |
From hostd:
Seems the VM's power state is 'VM_STATE_OFF' but when we do a refresh, we get the state as powered on. Not really sure why this can be happening. Maybe @dougm may have some ideas. We report in the portlayer logs:
The hostd logs doesn't seem to indicate any problems with reconfiguring. The VM seems to be powered off. |
The code that is preventing us from hitting WaitForResult() to get the exit code was added a month ago in #5445 to fix another edge case. I think we can do a few things:
The best fix is the first one. There might be another edge case that might make the second one problematic. |
Hit this on longevity last night as well: |
I've broken down the hostd of this issue. The VM has powered off BEFORE VC thinks the kill program doesn't exist. After talking to @dougm a few days ago, I think the're shutting down before VC detects guest tools. It appears to call it anyway and is successful. VC returns 3016, we attempt to perform poweroff while the VM is already powered off. VIC then proceeds to do a refresh and gets the wrong state value. This is the central problem. Why are we getting an incorrect powerstate value when the VM is clearly powered off? Here are choice snippets from the logs:
There is nothing concurrent happening at all! I think this might be a bug in VC (that we might be able to work around). I think because we tried to kick off the kill before VC has acknowledged guest tools, it gets confused and returns incorrect state data. I'll look at our code again to make sure it's not us. @dougm mentioned he is going to add some code to check if VC has seen guest tools. He's doing it for us to call in our CI tests. I think we should also call the same code before we try to execute a toolbox function (e.g. start kill program). One last comment. We report concurrent access error in our portlayer logs. That is a made up error inside of the portlayer. That is not an error from VC. |
One last thought. I see in the portlayer log that when we do an update, the change version after the stop is the same change version from after we started the container. |
The above is the log from hostd after we attempt a poweroff operation. We only perform the poweroff because we believe the kill operation failed, but it actually succeeded. This supports the theory that since VC hasn't established the existence of the toolbox yet, successful operations (e.g. startguestprogram kill) fails to get registered and VC instead reports what it thinks is the toolbox not running. Then when the poweroff attempts to power off the machine from an already powered off state, it sees an invalid transition and attempts to correct it's internal state machine. I believe it is around this time that we do a refresh and get an invalid state. |
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram returns an error 3016 even if the toolbox executed and the VM was powered off. When this happens, we try to power off the VM, throwing vSphere into an invalid state transition, which it needs to recover from. During this time, we have seen refreshing properties return invalid powerstate value and prevents us from reading the exit code in portlayer's commit code. To prevent this, we now wait to see if the container VM powered off after startGuestProgram, regardless if it returns an error. Resolves vmware#5803
I put out a PR #6077 that attempts to solve this problem back at the kill operation. Instead of only waiting for power state if startGuestProgram returns nil. I now wait for power state regardless. Hopefully, this will prevent the poweroff operation from executing when the VM is already off. |
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram returns an error 3016 even if the toolbox executed and the VM was powered off. When this happens, we try to power off the VM, throwing vSphere into an invalid state transition, which it needs to recover from. During this time, we have seen refreshing properties return invalid powerstate value and prevents us from reading the exit code in portlayer's commit code. To prevent this, we now wait to see if the container VM powered off after startGuestProgram, regardless if it returns an error. Resolves vmware#5803
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram returns an error 3016 even if the toolbox executed and the VM was powered off. When this happens, we try to power off the VM, throwing vSphere into an invalid state transition, which it needs to recover from. During this time, we have seen refreshing properties return invalid powerstate value and prevents us from reading the exit code in portlayer's commit code. To prevent this, we now wait to see if the container VM powered off after startGuestProgram, regardless if it returns an error. Resolves vmware#5803
If a container VM starts up and we stop it before VC detects the toolbox, startGuestProgram returns an error 3016 even if the toolbox executed and the VM was powered off. When this happens, we try to power off the VM, throwing vSphere into an invalid state transition, which it needs to recover from. During this time, we have seen refreshing properties return invalid powerstate value and prevents us from reading the exit code in portlayer's commit code. To prevent this, we now wait to see if the container VM powered off after startGuestProgram, regardless if it returns an error. Resolves vmware#5803
Seen in 6.5:
5-6-1-VSAN-Simple.zip
Initial Thoughts: Is this an issue confirming insecure registries?
The text was updated successfully, but these errors were encountered: