-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guest shutdown support in endpointVM #6943
Conversation
7b90176
to
f5f6e1a
Compare
7c9fb5e
to
903e5df
Compare
General question: should no vic sessions be left upon uninstall? I'm assuming yes and assuming this change should address those orphaned sessions... |
Need to test against VC:
@cgtexmex has also noted |
75c40fe
to
dac4259
Compare
dac4259
to
b992908
Compare
Debugging left over dynamic config client: tether.debug full shutdown trace
before delete:
after delete:
We can see in the tether.debug that we've called the client logout, that there's only one from the
@dougm any ideas? We're calling session.Logout but it's not reliably taking effect. |
@hickeng I don't know what vic-dynamic-config is, but maybe the vapi client used for the tags API needs to terminate its session? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly code I have limited exposure to, so the review focuses on more superficial things. None of these comments are blockers.
Fixes tether unit tests to run as non-root (again).
Should this be a separate commit?
cmd/tether/main_linux.go
Outdated
|
||
go func() { | ||
for s := range sigs { | ||
switch s { | ||
case syscall.SIGHUP: | ||
log.Infof("Reloading tether configuration") | ||
tthr.Reload() | ||
case syscall.SIGUSR1, syscall.SIGUSR2, syscall.SIGPWR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include both SIGUSR1
and SIGUSR2
here? Is there a chance we'd ever want to use one for something other than "stop" in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments in the code for posterity.
cmd/tether/main_linux.go
Outdated
@@ -123,14 +123,22 @@ func halt() { | |||
|
|||
func startSignalHandler() { | |||
sigs := make(chan os.Signal, 1) | |||
signal.Notify(sigs, syscall.SIGHUP) | |||
signal.Notify(sigs, syscall.SIGHUP, syscall.SIGUSR1, syscall.SIGUSR2, syscall.SIGPWR, syscall.SIGTERM, syscall.SIGINT) | |||
|
|||
go func() { | |||
for s := range sigs { | |||
switch s { | |||
case syscall.SIGHUP: | |||
log.Infof("Reloading tether configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we add via signal %s
to this log message for consistency? (Someone looking at the code can see that there's only one signal that generates this log line, but I think it's better not to assume everyone looking at the log messages is also reading the code.)
cmd/vic-init/main_linux.go
Outdated
@@ -201,3 +254,28 @@ func defaultIP() string { | |||
|
|||
return toolbox.DefaultIP() | |||
} | |||
|
|||
func startSignalHandler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a pattern for sharing base functionality between vic-init
and `tether? Consistency in the way we handle signals seems like something we'd want to preserve in the future (if for no other reason than developer sanity), and duplicating the code allows for unintentional drift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a good chunk of shared function in lib/tether
, however the code in cmd/tether
and cmd/vic-init
is expressly that needed to support different behaviours.
What you're proposing is definitely desirable but I think necessitates the toolbox interface for system interaction that I noted above. This would be needed to order to avoid the package local tthr
skyhook that is used in this function.
I'll revisit this; the most that I'm likely to do prior to designing the system interaction portion of vcsim is to make this a lib/tether
function that takes tthr
as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, can we at least add a comment in each place referring to the other? That might increase the chances of remembering to update one when we update the other.
|
||
// The EventManager cannot be destroyed like this as it isn't a ManagedEntity. | ||
// TODO: we do need to ensure the EventHistoryCollector is destroyed, but that's a specific call | ||
// and requires a govmomi change. At least it is lifecycle coupled with the session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file an issue for this TODO and cross-link it with this code change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in PR vmware/govmomi#962 - comment just needs removing.
lib/portlayer/portlayer.go
Outdated
// store.Finalize(ctx) | ||
exec.Finalize(op) | ||
// network.Finalize(ctx) | ||
// logging.Finalize(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are some of these Finalize calls commented out? I don't see an explanation here, in the commit message, or in the PR summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because at the moment the methods don't exist as there's no Finalization required for session handling in those packages.
I'll add the stub functions in.
err := ts.ln.Close() | ||
|
||
// so far I have not identified any routines that need to be | ||
// explicitly stopped after the underlying connection is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a symptom someone might observe that would indicate that a routine would need to be explicitly stopped here? If so, can we document that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add basic doc note:
// if vspc server fails to shut down in a timely manner then the clean endpointVM shutdown will fail
// this failure to shut down cleanly can be observed in the "/var/log/vic/init.log" or
// "[datastore] endpointVM/tether.debug" log files as vic-init will report that it's waiting for the portlayer to exit.
@dougm |
967a696
to
3e8b2d3
Compare
3e8b2d3
to
904c0d3
Compare
72f1ab2
to
ef8da5e
Compare
tracking down how vicadmin gets the VCH name for presentation on the dashboard page. It uses |
85186e0
to
b9c0a41
Compare
e429f9d
to
73b4a4e
Compare
44290d4
to
06ba697
Compare
@@ -532,7 +544,7 @@ func newSession(ctx context.Context, config *config.VirtualContainerHostConfigSp | |||
User: url.UserPassword(config.Username, config.Token), | |||
Thumbprint: config.TargetThumbprint, | |||
Keepalive: defaultSessionKeepAlive, | |||
UserAgent: version.UserAgent("vic-engine"), | |||
UserAgent: version.UserAgent("vic-dynamic-config"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least this portion of this PR should be included in any cherry-pick of #7887. Therefore, this may be worth submitting separately (in a way that more clearly references the previous commit).
3f6f4c7
to
d3338cb
Compare
This ensures that each vmomi client has a unique user-agent string so it is possible to link vmomi sessions back to the originating component within a VCH. The only user-agent that needed updating was the one used to pull project specific configuration from admiral/harbor. This follows on from PR vmware#7887
This restores use of the mechanisms allowing us to run tether tests as non-root. In this case it's simply ensuring we are checking whether there is a directory prefix that should be prepended.
The disk manager is intended to be a singleton that controls locking and use of the storage subsystems. It may function with soft failures if there are multiple instances of it, but it's not tested or intended to be used in that manner. This updates the container, image and volume logic to use a singleton disk manager that's held at the portlayer level instead of within the individual subsystems.
There are multiple places in the code where we have to call power off as a precaution if a clean shutdown path does not complete in the expected timeframe. This is inherently a race with the in-guest shutdown so we need to be able to detect if an error from the power off call is because the VM is _already_ powered off and can be squashed, or is an actual failure to power off and needs to be propagated. This commit adds a convenience method for performing that check and switches existing checks to use it.
The interfaces within the endpointVM get renamed to their network roles and an entry is added to /etc/hosts so that it's easy for services to bind to specific network roles irrespective of actual network configuration. As components do not launch until the network is fully configured and initialized this works as expected, but fails for the pprof server embedded in vic-init that is performing the network initialization. This adds logic to block the pprof initialization until the expected name (client.localhost) resolves. This only impacts execution with debug level of >=2.
d3338cb
to
5e23b1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but no serious concerns.
cmd/docker/main.go
Outdated
@@ -122,12 +123,16 @@ func main() { | |||
serveAPIWait := make(chan error) | |||
go api.Wait(serveAPIWait) | |||
|
|||
// signal.Trap explicitly calls os.Exit so an exit logic has to be rolled in here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... so an exit logic ...
Should this be "so any exit logic"?
tthr.Stop() | ||
case syscall.SIGTERM, syscall.SIGINT: | ||
log.Infof("Stopping system in lieu of restart handling via signal %s", s.String()) | ||
// TODO: update this to adjust power off handling for reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this reference a GitHub issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth it. Tether needs heavy redesign - this comment was purely as a note for if/when copying code into a new implementation.
I also do not have an issue for reworking tether - it's part of the larger design exercise to add support for extensions/etc that I'd consider part of a 2.0
log.Warn(err) | ||
} | ||
|
||
if debugLevel > 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to document this change? (And the corresponding change to reboot
, below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually bringing it inline with other behaviours where >1 introduces small behaviour changes, and >2 introduces the more significant behavioural changes.
As for where we document this - yes it does need documenting... as do all of the others.
vmware/vic-tasks#84
cmd/vic-init/main_linux.go
Outdated
@@ -201,3 +254,28 @@ func defaultIP() string { | |||
|
|||
return toolbox.DefaultIP() | |||
} | |||
|
|||
func startSignalHandler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, can we at least add a comment in each place referring to the other? That might increase the chances of remembering to update one when we update the other.
|
||
// this is a VERY basic throttle so that we don't DoS the remote when the client is NotAuthenticated. | ||
// this should be removed/replaced once there is structured connection management in place to trigger re-authentication as required. | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I recall you mentioning that we had a standardized way to do back-off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry
package, there are some good examples around in the code. see https://github.com/vmware/vic/blob/master/pkg/retry/retry.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have the retry but we have knowledge about the keepalive triggering reauth (20s at time time) and don't expect to need or want exponential backoff. That package also assumes time-bounds (e.g. Max retry time) whereas this is expected to be indefinite.
Basically while I could probably force the backoff package to do basic throttling I don't think it's a good fit here. We may want an actual throttle package, however what we're really after here structured connection management so this doesn't have to occur inline in product code at all.
network.Finalize(ctx) | ||
logging.Finalize(ctx) | ||
metrics.Finalize(op) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems likely that we'll forget to update this if we have a new component. (However, I don't have a specific suggestion for improving this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have suggestions ;) An extension interface that has structured registration, initialization, finalization, etc. Too much for now I think.
// Create view of VirtualMachine objects under the VCH's resource pool | ||
Config.ContainerView, err = mngr.CreateContainerView(ctx, pool.Reference(), []string{"VirtualMachine"}, true) | ||
Config.ContainerView, err = mngr.CreateContainerView(op, pool.Reference(), []string{"VirtualMachine"}, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a constant we could reference instead of VirtualMachine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can find. @dougm?
lib/install/management/appliance.go
Outdated
} | ||
} | ||
|
||
if err != nil || !tools { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is currently written, I find myself wondering things like "In what cases will we hit this block?" (Where was err
set? In what cases will/won't it be nil
? Have we already logged the err
?)
I don't have a specific suggestion for how to improve this method, but I don't think the current structure is the clearest way to communicate the intended flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be structured like this:
if tools {
Log message about trying to use GuestShutdown
If success return, otherwise log error message
}
Log message about doing VM Power off
do VM power off
lib/install/management/appliance.go
Outdated
} | ||
} | ||
|
||
if err != nil || !tools { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be structured like this:
if tools {
Log message about trying to use GuestShutdown
If success return, otherwise log error message
}
Log message about doing VM Power off
do VM power off
5e23b1c
to
aa54573
Compare
https://ci-vic.vmware.com/vmware/vic/19731/7 failure is due to parallel CI issue generating CA certs. |
aa54573
to
2e0abb2
Compare
lib/install/management/appliance.go
Outdated
d.op.Debugf("Guest shutdown failed: %s", err) | ||
} | ||
|
||
d.op.Warnf("Guest tools unavailable, resorting to power off - sessions will be left open") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Guest tools unavailable
is accurate after the restructuring.
(Aside: I find the new structure much easier to read.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not do a complete re-review, but a quick spot-check of areas where changes were requested only yielded the above issue about the log message.
This adds neat shutdown support in the endpointVM: * cleans up Views, Managers and sessions in portlayer * cleans up session in personality * cleans up and purges all active sessions in vicadmin Updates the way tether waits for sessions to exit. Updates vic-machine to use guest shutdown for endpointVM
2e0abb2
to
effcb4c
Compare
This adds neat shutdown support in the endpointVM:
Unifies diskmanager usage on a single instance for the storage
component. This is unrelated to the rest of the PR but is the correct usage
given locking assumptions and will reduce retries of vmomi operations.
Updates vic-machine to use guest shutdown for polite shutdown first.
Fixes tether unit tests to run as non-root (again).
Notes:
Need to find an effective way to test this. I was planning on adding the name of the VCH to the client but that ran into the immovable object that is
session.Session
usage. I still think this would be a useful UX enhancement for both us and customers, andsession.Session
definitely needs attention.Currently there is still occasional dangling sessions from the personality admiral client. In this case client.Logout gets called but it does not occur. In the last full-ci run this happened twice:
This is
not ready to mergeuntil the hardcoded filtering for 1.4.0-dev sessions is improved.-dev
, skipping the upgrade leaked sessions. This can be removed once old versions are not expected to leak sessions.Todo:
-dev
to omit sessions from GA code. This means it will not currently match RC builds but given we're only using it for warning currently I'm ok with that.could not initialize port layer
errors rather than log.Fatal