Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In offline environments, when restarting after a power outage, containers with “dead” status are not recreated again. #7406

Open
diegoortegav opened this issue Dec 11, 2024 · 10 comments
Assignees

Comments

@diegoortegav
Copy link

diegoortegav commented Dec 11, 2024

Expected Behavior

After a restart of the server (this one without internet connectivity), the edgeAgent (via aziot-edged) should destroy the modules with “dead” status and create it from scratch based on the docker image.

Current Behavior

After a restart of the server (this one without internet connectivity), the edgeAgent (via aziot-edged) destroys the modules with “dead” status but does not create them from scratch based on the docker image.

Steps to Reproduce

On a properly deployed device with iotedge and no internet connection.

  1. Bring a container or several containers to the dead status
  2. Reboot the device.

Context (Environment)

  • No internet connectivity (offline).
  • Containers in “dead” status (after a power outage).

Output of iotedge check

Click here
Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
× aziot-identity-service package is up-to-date - Error
    could not query https://aka.ms/azure-iotedge-latest-versions for latest available version
        caused by: could not query https://aka.ms/azure-iotedge-latest-versions for latest available version
        caused by: error trying to connect: unsuccessful tunnel (HTTP/1.1 403 For)
        caused by: unsuccessful tunnel (HTTP/1.1 403 For)
‼ host time is close to reference time - Warning
    Could not query NTP server
        caused by: Could not query NTP server
        caused by: could not receive NTP server response: Resource temporarily unavailable (os error 11)
        caused by: Resource temporarily unavailable (os error 11)
√ production readiness: identity certificates expiry - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK


Connectivity checks (aziot-identity-service)
--------------------------------------------
× host can connect to and perform TLS handshake with iothub AMQP port - Error
    Failed to do TLS Handshake, Connection Attempt Timed out in 70 Seconds
        caused by: Failed to do TLS Handshake, Connection Attempt Timed out in 70 Seconds
        caused by: deadline has elapsed
× host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Error
    Could not connect to ... : could not complete TLS handshake
        caused by: Could not connect to ... : could not complete TLS handshake
        caused by: unsuccessful tunnel (HTTP/1.1 403 For)
× host can connect to and perform TLS handshake with iothub MQTT port - Error
    Failed to do TLS Handshake, Connection Attempt Timed out in 70 Seconds
        caused by: Failed to do TLS Handshake, Connection Attempt Timed out in 70 Seconds
        caused by: deadline has elapsed
× host can connect to and perform TLS handshake with DPS endpoint - Error
    Could not connect to global.azure-devices-provisioning.net : could not complete TLS handshake
        caused by: Could not connect to global.azure-devices-provisioning.net : could not complete TLS handshake
        caused by: unsuccessful tunnel (HTTP/1.1 403 For)


Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
× aziot-edge package is up-to-date - Error
    Error while fetching latest versions of edge components: could not send HTTP request
        caused by: Error while fetching latest versions of edge components: could not send HTTP request
        caused by: error trying to connect: unsuccessful tunnel (HTTP/1.1 403 For)
        caused by: unsuccessful tunnel (HTTP/1.1 403 For)
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
× Agent image is valid and can be pulled from upstream - Error
    Failed to get edge Agent image
        caused by: Failed to get edge Agent image
        caused by: docker returned exit status: 1, stderr = Error response from daemon: Get "https://xxx/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
‼ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - Warning
    The proxy setting for IoT Edge Agent "http://x.x.x.x:3128", IoT Edge Daemon "http://127.0.0.1:3128", IoT Identity Daemon "http://127.0.0.1:3128", and Moby "http://127.0.0.1:3128" may need to be identical.
        caused by: The proxy setting for IoT Edge Agent "http://x.x.x.x:3128", IoT Edge Daemon "http://127.0.0.1:3128", IoT Identity Daemon "http://127.0.0.1:3128", and Moby "http://127.0.0.1:3128" may need to be identical.


Connectivity checks
-------------------
× container on the default network can connect to upstream AMQP port - Error
    Container on the default network could not connect to ...:5671
        caused by: Container on the default network could not connect to ...:5671
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (Operation timed out)
× container on the default network can connect to upstream HTTPS / WebSockets port - Error
    Container on the default network could not connect to ...:443
        caused by: Container on the default network could not connect to ...:443
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (The proxy tunnel request to proxy 'http://x.x.x.x:3128/' failed with status code '403'.")
× container on the default network can connect to upstream MQTT port - Error
    Container on the default network could not connect to ...:8883
        caused by: Container on the default network could not connect to ...:8883
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (Operation timed out)
× container on the IoT Edge module network can connect to upstream AMQP port - Error
    Container on the azure-iot-edge network could not connect to ...:5671
        caused by: Container on the azure-iot-edge network could not connect to ...:5671
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (Operation timed out)
× container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - Error
    Container on the azure-iot-edge network could not connect to ...:443
        caused by: Container on the azure-iot-edge network could not connect to ...:443
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (The proxy tunnel request to proxy 'http://x.x.x.x:3128/' failed with status code '503'.")
× container on the IoT Edge module network can connect to upstream MQTT port - Error
    Container on the azure-iot-edge network could not connect to ...:8883
        caused by: Container on the azure-iot-edge network could not connect to ...:8883
        caused by: docker returned exit status: 1, stderr = One or more errors occurred. (Operation timed out)
24 check(s) succeeded.
2 check(s) raised warnings.
13 check(s) raised errors.

Device Information

  • Host OS: Ubuntu 22.04
  • Architecture: amd64
  • Container OS: Linux (Alpine & Debian)

Runtime Versions

  • aziot-edged: 1.5.13
  • Edge Agent: 1.4.43
  • Edge Hub: 1.4.43
  • Docker/Moby: 27.0.3-1, build 7d4bcd863a4c863e650eed02a550dfeb98560b83

Logs

Important, in these logs, the module called moduleZ is the one that has a “dead” status.

Note that in the aziot-edged logs at 19:15:58 it appears as edgeAgent requests to delete moduleZ.

2024-12-11T19:15:58Z [INFO] - <-- DELETE /modules/moduleZ?api-version=2022-08-03 {"host": "mgmt.sock:80", "connection": "close"}
2024-12-11T19:15:58Z [INFO] - Removing module moduleZ...
2024-12-11T19:15:58Z [INFO] - --> 204 {}
2024-12-11T19:15:58Z [INFO] - Removing listener for module moduleZ
2024-12-11T19:15:58Z [INFO] - Stopping listener for module moduleZ

After aziot-edged destroys the container, the moduleZ is not recreated (its docker image still exists).

We note that it appears continuously in the logs:

2024-12-11T19:16:08Z [INFO] - <-- PUT /identities/moduleZ?api-version=2022-08-03 {"accept": "application/json", "host": "mgmt.sock:80", "connection": "close", "content-type": "application/json", "content-length": "59"}
2024-12-11T19:18:11Z [INFO] - --> 500 {"content-type": "application/json"}

And in the edgeAgent logs it appears:

19:24:36.060 +00:00 [INF] - Unable to process module moduleZ add or update as the module identity could not be obtained
aziot-edged logs

aziot-edged.txt

aziot-identityd logs

aziot-identityd.txt

edge-agent logs

edgeAgent_log.txt

edge-hub logs

edgeHub_log.txt

Additional Information

It is important to us that you work in a non-internet environment, even if these devices are connected once a year.

@gauravIoTEdge
Copy link
Contributor

The problem is with your container:

Image

Please google and fix your 400 and things should resolve.

@diegoortegav
Copy link
Author

The problem is with your container:

Image

Please google and fix your 400 and things should resolve.

The problem you mention has been generated on purpose, as an intermediate step to get docker to set the container status from “exited” to “dead”.

Image

Steps to follow to get a container to be marked by Docker in DEAD status. (this has been the way we have found to recreate the DEAD state, which was actually produced after a power outage)

# First we need edgeAgent to realize that our container has problems, for that, we will execute the following two lines:
docker exec -u root moduleZ rm -rf /
docker kill --signal=SIGKILL moduleZ  # ==> AT THIS POINT, it is when the error you mentioned appears.

# With the following two commands we can change the container status from “exited” to “dead”.
chattr +i -R /var/lib/docker/containers/XXXX
docker rm moduleZ    # ==> This command fails, and docker now marks the container as “dead”.

# Now docker will say that the container status is “dead”.
docker inspect --format="{{.State.Status}}" moduleZ

# We must allow aziot-edged to delete the container in the “dead” status, to do this we will run:
chattr -i -R /var/lib/docker/containers/XXXX

# If we check the aziot logs, we will see how the DELETE method is called.

The real problem is, that when the container is marked in “dead” status, aziot-edged deletes the container, and it is not created anymore.

At the end of the aziot-edged logs you can see this:
Image

And in the edgeAgent logs it appears:
Image

@victorabb
Copy link

Hi here, i'm @diegoortegav colleague.

Worth to mention that the exact same issue has been reproduced with newer versions of EdgeAgent and EdgeHub:

  • aziot-edged: 1.5.13
  • Edge Agent: 1.5.15
  • Edge Hub: 1.5.15
  • Docker/Moby: 27.0.3-1, build 7d4bcd863a4c863e650eed02a550dfeb98560b83

@gauravIoTEdge thanks for you help, as @diegoortegav mentioned, the docker failure has been intentionally forced to be able to reproduce the DEAD status and see the behaviour of EdgeAgent as we've seen it deleting the modules after a power outage in offline installations in customer side.

When this exact same case is tested in an online edge, EdgeAgent will update the deployment from the cloud, and download (or ask to aziot-edged to download) the removed module, and then it will rerun it. But in offline mode it won't recreate the container from the image existing in the docker images that is the expected behaviour to continue with the offline operations before the outage.

Thanks!

@gauravIoTEdge
Copy link
Contributor

@diegoortegav @victorabb - please allow me to tell you what a pleasure this is! I see what you two are saying. I learnt something new today. Sorry, I did not see this, or I would've responded sooner.

A bit about Edge Agent: It uses the Docker Engine API underneath. It's just a passthrough. So it's less EdgeAgent behavior, and more Docker daemon. Now, if you disagree, I'm happy to work with you (and also ask internally in our team).

I'm not sure if a dead container can be restarted: docker/cli#502 (comment)

This is from docker reference:

Image

So perhaps what you're asking might not be possible at the Docker level. The only suggestion I can make is to suggest playing with restart policies, but you probably tried that already?

@victorabb
Copy link

hi @gauravIoTEdge . thanks for your help.

yes we have restar policy on all our modules:
"restartPolicy": "always",

We understand that EdgeAgent uses internal docker engine. but as you can see in @diegoortegav logs :
Image

Is the EdgeAgent + aziot-edged that determines to DELETE the module due to it's dead status. What I think is ok, and then this leads to do a docker rm of the module.

The issue is that individually the EdgeAgent has took the decision of removing the docker container, and in offline scenario this means that the container won't be recovered.

On Online scenario, edgeAgent will download the deployment again, will see that the desired status of the mudule is "running" and as the module identity is not there it will POST the identity and pull again the image and start the container.

What we are asking is that while offline, if the docker container goes into "dead" mode for any reason, is fine that EdgeAgent removes it, as it wont work. But afterwards edgeAgent should need to run the container form the original image that is available on the docker images local repository.

Can you check with the team if this is feasible? sounds pretty much what is already being done while online, but in offline scenario edgeAgent don't need to pull again the original docker image., but just run the docker run .

Thanks for your support!

@gauravIoTEdge
Copy link
Contributor

@victorabb - Currently, IoT Hub is the only control plane for edge devices, and I can confidently tell you that we are not currently pursuing offline deployments as a priority. We do need some interaction with IoT Hub for deployments.

In theory, one could connect IoT Edge devices to an IoT Hub running on Azure Stack Hub for complete offline scenarios; however, that requires the use of an Azure Stack Hub which can be price prohibitive.

@konichi3 - for visibility and (potential) roadmap questions.

@victorabb
Copy link

@gauravIoTEdge @konichi3
According to the documentation : https://learn.microsoft.com/en-us/azure/iot-edge/offline-capabilities?view=iotedge-1.5
Azure IoT Edge supports extended offline operations on your IoT Edge devices and enables offline operations on downstream devices too. As long as an IoT Edge device has had one opportunity to connect to IoT Hub, that device and any downstream devices can continue to function with intermittent or no internet connection

But independently of that, this issue can happen to anyone with intermittent connection, if during the disconnection period you are so unlucky that your containers become DEAD for any reason, it will mean that the modules will be removed from the device, and this service/s won't work anymore until connection is back , and that can be after a long period.

That's why we believe there's a way to change the behaviour of edgeAgent if offline, to not unilaterally remove the modules from the deployment. So if it detects the modules are dead, should only stop the container, remove the docker container and run it again from the image.

Thanks!

@gauravIoTEdge
Copy link
Contributor

@victorabb - There's no disagreement with anything you're saying.

Just to clarify, IoT Edge does support extended offline operations -- if your containers hadn't died, IoT Edge would have continued to function and support them with no issues. However, deployments don't happen to be part of those extended offline operations (I'm just stating where the product is today).

What you are asking about would be a feature request - and that is for @konichi3 to answer.

You're not wrong. You're correct in your assessment and observations. What you're suggesting is entirely reasonable. But that decision is not up to me (or other Engineers in the team).

@victorabb
Copy link

Hello, @konichi3 @gauravIoTEdge have you had the opportunity to consider this feature request? any update you can give us?

Thanks

@gauravIoTEdge
Copy link
Contributor

@varunpuranik - Any thoughts here? Their observations seemed reasonable to me, but I don't know enough to be definitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants