-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In offline environments, when restarting after a power outage, containers with “dead” status are not recreated again. #7406
Comments
The problem you mention has been generated on purpose, as an intermediate step to get docker to set the container status from “exited” to “dead”. Steps to follow to get a container to be marked by Docker in DEAD status. (this has been the way we have found to recreate the DEAD state, which was actually produced after a power outage) # First we need edgeAgent to realize that our container has problems, for that, we will execute the following two lines:
docker exec -u root moduleZ rm -rf /
docker kill --signal=SIGKILL moduleZ # ==> AT THIS POINT, it is when the error you mentioned appears.
# With the following two commands we can change the container status from “exited” to “dead”.
chattr +i -R /var/lib/docker/containers/XXXX
docker rm moduleZ # ==> This command fails, and docker now marks the container as “dead”.
# Now docker will say that the container status is “dead”.
docker inspect --format="{{.State.Status}}" moduleZ
# We must allow aziot-edged to delete the container in the “dead” status, to do this we will run:
chattr -i -R /var/lib/docker/containers/XXXX
# If we check the aziot logs, we will see how the DELETE method is called. The real problem is, that when the container is marked in “dead” status, aziot-edged deletes the container, and it is not created anymore. |
Hi here, i'm @diegoortegav colleague. Worth to mention that the exact same issue has been reproduced with newer versions of EdgeAgent and EdgeHub:
@gauravIoTEdge thanks for you help, as @diegoortegav mentioned, the docker failure has been intentionally forced to be able to reproduce the DEAD status and see the behaviour of EdgeAgent as we've seen it deleting the modules after a power outage in offline installations in customer side. When this exact same case is tested in an online edge, EdgeAgent will update the deployment from the cloud, and download (or ask to aziot-edged to download) the removed module, and then it will rerun it. But in offline mode it won't recreate the container from the image existing in the docker images that is the expected behaviour to continue with the offline operations before the outage. Thanks! |
@diegoortegav @victorabb - please allow me to tell you what a pleasure this is! I see what you two are saying. I learnt something new today. Sorry, I did not see this, or I would've responded sooner. A bit about Edge Agent: It uses the Docker Engine API underneath. It's just a passthrough. So it's less EdgeAgent behavior, and more Docker daemon. Now, if you disagree, I'm happy to work with you (and also ask internally in our team). I'm not sure if a dead container can be restarted: docker/cli#502 (comment) This is from docker reference: So perhaps what you're asking might not be possible at the Docker level. The only suggestion I can make is to suggest playing with restart policies, but you probably tried that already? |
hi @gauravIoTEdge . thanks for your help. yes we have restar policy on all our modules: We understand that EdgeAgent uses internal docker engine. but as you can see in @diegoortegav logs : Is the EdgeAgent + aziot-edged that determines to DELETE the module due to it's dead status. What I think is ok, and then this leads to do a docker rm of the module. The issue is that individually the EdgeAgent has took the decision of removing the docker container, and in offline scenario this means that the container won't be recovered. On Online scenario, edgeAgent will download the deployment again, will see that the desired status of the mudule is "running" and as the module identity is not there it will POST the identity and pull again the image and start the container. What we are asking is that while offline, if the docker container goes into "dead" mode for any reason, is fine that EdgeAgent removes it, as it wont work. But afterwards edgeAgent should need to run the container form the original image that is available on the docker images local repository. Can you check with the team if this is feasible? sounds pretty much what is already being done while online, but in offline scenario edgeAgent don't need to pull again the original docker image., but just run the docker run . Thanks for your support! |
@victorabb - Currently, IoT Hub is the only control plane for edge devices, and I can confidently tell you that we are not currently pursuing offline deployments as a priority. We do need some interaction with IoT Hub for deployments. In theory, one could connect IoT Edge devices to an IoT Hub running on Azure Stack Hub for complete offline scenarios; however, that requires the use of an Azure Stack Hub which can be price prohibitive. @konichi3 - for visibility and (potential) roadmap questions. |
@gauravIoTEdge @konichi3 But independently of that, this issue can happen to anyone with intermittent connection, if during the disconnection period you are so unlucky that your containers become DEAD for any reason, it will mean that the modules will be removed from the device, and this service/s won't work anymore until connection is back , and that can be after a long period. That's why we believe there's a way to change the behaviour of edgeAgent if offline, to not unilaterally remove the modules from the deployment. So if it detects the modules are dead, should only stop the container, remove the docker container and run it again from the image. Thanks! |
@victorabb - There's no disagreement with anything you're saying. Just to clarify, IoT Edge does support extended offline operations -- if your containers hadn't died, IoT Edge would have continued to function and support them with no issues. However, deployments don't happen to be part of those extended offline operations (I'm just stating where the product is today). What you are asking about would be a feature request - and that is for @konichi3 to answer. You're not wrong. You're correct in your assessment and observations. What you're suggesting is entirely reasonable. But that decision is not up to me (or other Engineers in the team). |
Hello, @konichi3 @gauravIoTEdge have you had the opportunity to consider this feature request? any update you can give us? Thanks |
@varunpuranik - Any thoughts here? Their observations seemed reasonable to me, but I don't know enough to be definitive. |
Expected Behavior
After a restart of the server (this one without internet connectivity), the edgeAgent (via aziot-edged) should destroy the modules with “dead” status and create it from scratch based on the docker image.
Current Behavior
After a restart of the server (this one without internet connectivity), the edgeAgent (via aziot-edged) destroys the modules with “dead” status but does not create them from scratch based on the docker image.
Steps to Reproduce
On a properly deployed device with iotedge and no internet connection.
Context (Environment)
Output of
iotedge check
Click here
Device Information
Runtime Versions
Logs
Important, in these logs, the module called moduleZ is the one that has a “dead” status.
Note that in the aziot-edged logs at 19:15:58 it appears as edgeAgent requests to delete moduleZ.
After aziot-edged destroys the container, the moduleZ is not recreated (its docker image still exists).
We note that it appears continuously in the logs:
And in the edgeAgent logs it appears:
aziot-edged logs
aziot-edged.txt
aziot-identityd logs
aziot-identityd.txt
edge-agent logs
edgeAgent_log.txt
edge-hub logs
edgeHub_log.txt
Additional Information
It is important to us that you work in a non-internet environment, even if these devices are connected once a year.
The text was updated successfully, but these errors were encountered: