-
Notifications
You must be signed in to change notification settings - Fork 936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shortly after an instance start failure, the instance is reported as RUNNING instead of STOPPED #14042
Comments
This seems expected, the container started ok, but then wasnt able to boot due to the process limit and stopped. |
I think that is not true, the start operation fails
|
Yeah, its a bit tricky, because when a start operation is running LXD reports the container is running, so it only reports it as stopped once the start operation fails. This is similar to #13453 (although different situation) in that the running status of an instance can change during the invocation of a single start/stop command. |
I think it might be better to update the status of the instance at the end of an operation. Contrary to the current implementation that is changing the status when the operation gets created. In my mind that should solve both issues. Another approach is to introduce "stopping", "starting", etc statuses. At the beginning of the operation the status is set to such an in-flux status. Once the operation finishes, depending on if the operation failed or succeeded, set the status to "stopped" or "running". What do you think? |
The "...ing" statuses isn't a new idea, see #10625 and has its own complexities.
One thing to note here is that there isn't a DB entry for instance status, its always calculated live, so there is no such thing as "update the status". We can see the current logic here: Interestingly this part: Seems to suggest that if the operationlock (not the same as the API operation) is in the state of starting, then the status is returned as stopped (and vice versa for stopping) - which is similar to what you suggest. So perhaps the start operationlock is completing, the lxc container is effectively started, Will need to look into it in more detail... |
Thanks for the code pointers, I reproduced this with debug logging. See below for details, I marked the lock lines with a I think the operation lock works while the instance is being started. On the start error, the lock is released, and another lock gets created shortly after for marking the instance as stopped. I think this 2nd lock is responsible for marking the instance as "running" (keeping in mind the logic you pointed at above). We can see the 2nd lock takes 1 second until it gets released, I assume this is exactly the time when the status is reported as running. This 2nd lock should probably have special semantics to be "on instance start error handling". So the logic you pointed at can calculate the status correctly as "stopped" in this case. Likewise, on an instance stop error, the logic could get special semantics from the lock that is handling the stopping error. Thus, report the instance as "running" if an error occurs on stop.
|
These lines are the key part in my view. |
But also this line is important:
As originally thought, its the lxc instance itself which stops, not LXD asking for it to stop, so it did "start" and then stops later on. |
And its that instance initiated stop that causes the stop operationlock to be created, which in turn causes
|
So somehow we need to resolve the conflict between the instance apparently failing to start:
With the reality that it did start and got far enough to trigger the stop hook when it did finally stop:
|
What is triggering this API call? I think we could add a flag to this call saying this is on error handling, so the lock gets that flag and the status calculation can react to this flag. |
liblxc - but it cant tell why its happening, or whether its happening during start up or later |
I see, thanks for the explanation. Now I get why it is so tricky with the lower level barrier in between. Another idea: We introduce a global list of instances with failures on start or stop or alike. The error handler on stop/start/etc. puts an entry into that list. The entries live shortly (say they get removed automatically after 10 seconds). An entry in this list overrides the instance status calculation. The internal endpoint to stop or start will clear the entry in that list. |
After contemplating on this for half a day: As far as the LXD-UI is concerned, there is a much simpler solution. We can defer the clearing of the cache to refetch the instance status after an error by some seconds. This is anyway an async operation, so not a problem if we refresh a bit later instead of immediately. We'll surface the error message right away to the user and refresh shortly after. With the delay, we avoid hitting this "reporting running, while the instance is just being stopped" bug. |
Required information
Issue description
When an instance is being started and the start operation fails, the instance is reported as
RUNNING
for a short amount of time. This is a problem for the UI, where we re-fetch the instance after the failure and the user will see the stopped instance in the wrong status.Steps to reproduce
config.limits.processes
to2
, so the instance fails to startSTOPPED
. This is the case if we rerunlxc ls
a second later, but shortly after the failure it should also be consistent:The text was updated successfully, but these errors were encountered: