Agent-Device Connection Robustness #538
BrianJKoopman
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
The problem of maintaining a robust connection between agents and their corresponding devices is one we haven't really addressed across all agents. Currently, most agents do not handle a network interruption well. In most cases, this breaks the TCP connection, causing a process in the agent to crash, but the agent remains online.
The current solution is to restart the process, either directly or by restarting the entire agent.
We haven't made an general recommendations for how to handle connection dropouts, as a result there are various implementations for recovering the connection in a handful of agents.
Two solutions come to mind:
My preference is to implement 2. But I will briefly describe a partial implementation of 1. (Feel free to skip this section.)
HostManager Reconnection
The best example I have for how this can be accomplished is in the Lakeshore 372 Agent. This Agent will attempt to establish a connection to the 372 in the
init_lakeshore
task. If the connection cannot be made (the 372 is offline) it will stop the reactor, bringing down the agent.socs/socs/agents/lakeshore372/agent.py
Lines 204 to 213 in 147149e
This does not handle dropouts post-initialization, but you could imagine doing a similar thing when the connection breaks during normal operation.
This solution has historically been requested by users, as it brings down the agent and Docker container, which is perhaps the most obvious sign that something is wrong. If the agent was online with a crashed process, it is perhaps less clear data isn't being collected.
Dynamically Reconnecting
Rather than performing a hard shutdown of the agent and letting the HostManager restart things, this option recognizes a connection drop, and dynamically reconnects to the device if possible. This can be handled at the agent level or the driver level. Most existing implementations are at the agent level, but I'd like to advocate for moving this lower -- to the drivers.
For completeness, we'll describe the two implementations.
Reconnecting in the Agent
Reconnecting in the Agent is the most common currently implemented solution. It involves catching any command interacting with the device, then rerunning an initialization function or task before letting the process loop again. Some examples of this are:
This ends up looking like:
while the 'initialize' task looks like:
I don't believe there are any examples of this being implemented on a task, but that would probably end up looking something like:
Reconnecting in the Drivers
Reconnecting in the drivers avoids including reconnect logic (and especially the try/catches) in the Agent code. This is especially relevant for longer tasks/processes that have many function calls that interact with the hardware.
I'm going to get into a more detailed example of how I think this should work, but wanted to point out part of the inspiration comes from this open PR.
With the caveat that this isn't functional nor very elegant and is part pseudo-code -- this is what I imagine this looking like. Starting with the agent:
The drivers then have the reconnect logic. I think we could feasibly provide some base class that handles the reconnection logic, but need to try it out to know how well that'll work in practice.
Other Connection Types
This discussion very much focuses on TCP. I have to give serial some more thought, but probably can be fairly similar. Then we need to dive into the other types like SNMP, but I believe we might already have some sort of reconnection there.
Thoughts on this design welcome! It sounded like the right direction when talking on the phone the other day. This'll need to happen in coordination with simonsobs/ocs#357 so that we can tell when the connection has degraded and alert against it. It'll be great to start implementing this though and to form the general recommendation for how this connections can be made robust.
Beta Was this translation helpful? Give feedback.
All reactions