Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

refactor agent_events handler #261

Merged
merged 22 commits into from
Nov 11, 2020
Merged

Conversation

bmc-msft
Copy link
Contributor

@bmc-msft bmc-msft commented Nov 4, 2020

Summary of the Pull Request

This PR refactors the agent_events function handler.

While debugging an agent that appeared to stay in the "init" state, we identified that a 'node state' event that should have set the state to 'free' was seen at the service, but somehow not saved. It isn't clear why the update was not saved. This PR attempts to refactor the agent_events handler in a handful of ways to make it easier to trace.

Note, it may be useful to review the individual commits, as GitHub is calling agent_events.py a "new file", rather than a renamed file.

Info on Pull Request

  1. Moves most of the actual state handling into onefuzzlib/agent_events.py. Testing the function code is unfortunately complex. This makes it such that we can call the individual methods from a repl shell or unit testing without the burdon of Azure Functions handlers
  2. Simplifies processing NodeEventEnvelope by directly operating on the objects, rather than potentially casting into NodeEvents
  3. Moves to explicit error handling rather than exceptions.
  4. For everything that is a Union of complex tasks, moved to separate methods for the underling context (example, on_worker_event now calls on_worker_event_running and on_worker_event_done)
  5. Always log in each of the primary on_state_update branches. There were multiple that we didn't.
  6. Since nodes now only send state updates on transition, we always save the node state as well as log the transition details in on_state_update.
  7. Unless the node is marked for deletion, always save the state in when the NodeState is init.
  8. The on_worker_event no longer updates node state. This must come from on_state_update now

Validation Steps Performed

  1. Standard integration tests.
  2. Have a scaleset with 3 nodes. Submit a libfuzzer template jobs. Once it's fully scheduled, submit another libfuzzer template job with a duration of 1 hour. Stop the first one (onefuzz jobs delete JOBID). Verify the second job eventually gets fully scheduled. Verify after 1 hour, the job stops and all the nodes get reimaged.

TODO

  • Re-add task level updating node state. Apparently, the agent isn't sending the state updates as we expect when a task state update also happens. While that should be addressed, that's out of scope for this PR. This PR needs to re-add that functionality back for the time being.

@bmc-msft bmc-msft merged commit ca209eb into microsoft:main Nov 11, 2020
@bmc-msft bmc-msft deleted the agent-events-cleanup branch November 11, 2020 23:28
@ghost ghost locked as resolved and limited conversation to collaborators Apr 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants