-
Notifications
You must be signed in to change notification settings - Fork 937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxc start fails despite stopped state #13453
Comments
@tomponline thanks for brainstorming this issue earlier, I've collected some additional information to ease debugging. I discovered a couple of more things that are relevant:
The following script reproduces the error while collecting lxc monitor logs. For me this fails reliability in about 5 seconds. #!/bin/sh
NAME=me
LOG=./reproduced.log
# stop if started
lxc stop $NAME 2> /dev/null || true
while true; do
# monitor, wipe log if repro failed
(lxc monitor me --pretty > $LOG) &
echo "starting instance"
lxc start $NAME
# wait until dbus is available (required for shutdown to work)
lxc exec $NAME -- sh -c 'while [ ! -S /run/dbus/system_bus_socket ]; do sleep 0.1; done'
# shutdown from the host
echo "shutting down the instance"
lxc exec $NAME -- shutdown -H now
while true; do
status=$(lxc list $NAME -cs --format csv)
if [ $status = STOPPED ] ; then
# this will fail sometimes
lxc start $NAME
rc="$?"
if [ $rc ]; then
echo "lxd start failed"
rerun="$(lxc list -cs --format csv $NAME)"
echo "lxc state immediately after failure: $rerun"
exit $rc
fi
# no error, retry
echo "did not repro, retrying"
lxc stop $NAME
break
fi
done
done |
And a monitor log using |
Perfect thanks, so that confirms my suspicion that there is a tiny duration of time where liblxc is reporting the container's state as stopped, before it has notified LXD that the guest has self-stopped which would trigger LXD's stop cleanup operation (which is then reporting the container's status as running). |
Thanks for the script Brett. When shutting down a container, LXC sets the container state to If The LXC log for the container with
The socket is closed partway through handling of the get_state command. If lxc could close the socket and wait for existing requests to complete before continuing with the container stop, then we wouldn't see this behavior. @mihalicyn might disagree, but I doubt that this is super straightforward/possible. My naive attempt with I've done a little testing and this doesn't appear to impact VM instances. |
Thanks for digging into this @MggMuggins!
Do you think that this deserves an upstream bug report to lxc?
Digging around in lxc's source code, it looks like there are some timeouts that are set as well - I'm not sure if that may have an affect. I'd be curious to take a look if you have a public copy of your effort. |
This doesn't work (I think) because SO_LINGER only prevents queued messages from being dropped; lxc hasn't queued a response to the client's request yet, so the race still exists. See canonical/lxd#13453 Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
When I say naive, I mean really naive: MggMuggins/lxc@0efd5c6 I considered a bug report but dismissed it since lxc does report a consistent transition from |
:-)
Sounds great, thanks for digging further. Please let me know how it goes either way. |
Fixes canonical#13453 Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
I think my "fresh eyes" were just "poor memory eyes"... 😅. I got as far as our ask of upstream:
But that doesn't resolve this; it just punts the race down the road a ways. Even if lxc doesn't interrupt Fundamentally, LXD (and clients) cannot make race-free decisions based on the state of instances because LXD does not maintain a canonical source for instance state; it is a middle-man between Without a bunch of design work I don't think it's feasible to truly fix this. However, checking for an ongoing operation after lxc returns significantly reduces the likelihood that I suspect that my initial assessment WRT VMs was wrong, they are likely affected by a similar race. |
If an instance self-stops while `statusCode()` is waiting for `getLxcState()` to finish, `statusCode()` may return a stale instance state. This PR is a workaround for the use-case in #13453 and significantly reduces the likelihood that `statusCode` returns a stale status. In an ideal world, LXD would maintain a canonical cluster-wide view of instance state. This would allow making race-free decisions based on whether an instance is running or not. For example: - Project CPU/RAM limits could be enforced at instance start instead of at instance creation - Volumes with content-type block could be attached to more than one instance without `security.shared`; instance start could fail if another instance with any shared block volumes is already running.
Thanks so much for working on this @MggMuggins and @tomponline! I'd like to have an idea of how frequently this race still occurs. In your testing of the fix, did the reproducer still trigger eventually with this fix? |
I got 15-20 iterations with no race; for comparison the script always reproduced it on the first try for me. I didn't see it again after the fix. |
Fixes canonical#13453 Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com> (cherry picked from commit a7e88b0)
Fixes canonical#13453 Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com> (cherry picked from commit a7e88b0)
Required information
Issue description
When an instance has recently shutdown and reports a state of
STOPPED
, an attempt to start the instance may fail with an error:The instance is already running
. I observe this on occasion while manually running commands, but I wasn't bothered too much by the behavior so I never bothered to report. I discovered a bug in our test framework that prevents us from running certain tests, so I'm reporting it now.Steps to reproduce
I see this in an integration test that fails intermittently, and I can reproduce this locally with the test running in a loop. A trivial reproducer could be made that launches an instance, shuts it back down (our test uses guest-initiated shutdown), then waits for
STOPPED
state before runninglxc start
.Information to attach
Note that messages about instance shutdown occur both before and after the start fails, yet the code that initiates the
lxc start
doesn't run untilSTOPPED
is the reported instance state.The text was updated successfully, but these errors were encountered: