Interrupt agent within pytest fixture if it crashes during testing #401
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This adds an interrupt to the agent in the agent runner pytest fixture in the event that the agent crashes and hangs during testing. It will also display the
stdout
/stderr
from the agent when this happens.To do this we also refactor most of the contents of
create_agent_runner_fixture()
into theAgentRunner
class so we can keep track of some things with some class attributes.I tried to present a clean git history. I tested at each commit here the functionality of the "early exit" path and once this timeout was added, the timeout against the test that I knew was hanging as described below. I also ran all ocs and socs tests from each
main
branch to make sure I wasn't breaking anything.(I've worked through this a couple times now, it was kind of tricky to debug and make sure I wasn't breaking anything. A few things I want to mention that were subtle:
stdout
/stderr
printing didn't really seem to work for me without the first commit here disabling output buffering. So thatpython -u
is important.proc.stdout.read()
method withinraise_subprocess()
to getstdout
, but would run into an I/O error trying to read from a closed file descriptor. There's a warning belowPopen.stdin
about strictly usingcommunicate()
to interact with the process -- for a different reason, but still, one of the iterations of my work on this I restructured things to get rid of the existingstdout.read()
calls. I might still do this, but in a future PR.)Motivation and Context
Resolves #398.
For some more context, we were seeing hanging agents cause all testing to hang. In GitHub Actions this means eventually hitting the 6 hour timeout. This is difficult to debug, especially since the agent logs aren't visible. This is a step in the direction of making agent logs visible, while hopefully preventing more of these agent hangs.
How Has This Been Tested?
Through lots of manual testing.
Testing the "Early Exit" Functionality
I wanted to make sure I preserved the existing functionality. I tested the "early exit" path by adding an invalid parameter to the
ocs/tests/default.yaml
for theFakeDataAgent
called--please-crash
. This causes the agent to crash immediately on startup, and we see a test failure and the agent output. Here's an example:Testing the Timeout on a Crashed Agent
I also tested in this socs branch, which is just a checkout of the working branch at the time that #398 was logged. This produces a crash in the agent on several tests, both of which we should discuss. The first is a crash that this PR can handle, here's example output:
We see the agent output, and the extra Exception that we raised related to the agent timeout. This is the functionality this PR is trying to add. Great!
Issues with Errors in Deferreds
However, there's another situation in this branch that I don't know how to handle currently, and that's when there is an unhandled error in a deferred within the task being tested. This happens in the
test_hwp_rotation_get_state
test. If we run:then we see the test start:
However, in the agent logs to disk we see (starting at the crash -- the rest is just normal agent startup and command execution for this task):
This has to do with the
Deferred
fromaction.deferred
:https://github.com/simonsobs/socs/blob/cf89f1f32405f350d6f1bf557c80602363cc22b9/socs/agents/hwp_pid/agent.py#L315-L317
I can add an
Errback
and this always exits. (It very rarely exits cleanly without, just to mention.) I just don't know how to exit the agent in this event at the moment. Open to suggestions.Agent Not Terminating
I don't really know how to reproduce the behavior of the agent not exiting on SIGINT, so I didn't test this. I didn't really touch that code though, so probably ok? Let me know if you know how to test this.
Types of changes
Checklist: