refactor agent_events handler #261

bmc-msft · 2020-11-04T21:22:13Z

Summary of the Pull Request

This PR refactors the agent_events function handler.

While debugging an agent that appeared to stay in the "init" state, we identified that a 'node state' event that should have set the state to 'free' was seen at the service, but somehow not saved. It isn't clear why the update was not saved. This PR attempts to refactor the agent_events handler in a handful of ways to make it easier to trace.

Note, it may be useful to review the individual commits, as GitHub is calling agent_events.py a "new file", rather than a renamed file.

Info on Pull Request

Moves most of the actual state handling into onefuzzlib/agent_events.py. Testing the function code is unfortunately complex. This makes it such that we can call the individual methods from a repl shell or unit testing without the burdon of Azure Functions handlers
Simplifies processing NodeEventEnvelope by directly operating on the objects, rather than potentially casting into NodeEvents
Moves to explicit error handling rather than exceptions.
For everything that is a Union of complex tasks, moved to separate methods for the underling context (example, on_worker_event now calls on_worker_event_running and on_worker_event_done)
Always log in each of the primary on_state_update branches. There were multiple that we didn't.
Since nodes now only send state updates on transition, we always save the node state as well as log the transition details in on_state_update.
Unless the node is marked for deletion, always save the state in when the NodeState is init.
The on_worker_event no longer updates node state. This must come from on_state_update now

Validation Steps Performed

Standard integration tests.
Have a scaleset with 3 nodes. Submit a libfuzzer template jobs. Once it's fully scheduled, submit another libfuzzer template job with a duration of 1 hour. Stop the first one (onefuzz jobs delete JOBID). Verify the second job eventually gets fully scheduled. Verify after 1 hour, the job stops and all the nodes get reimaged.

TODO

Re-add task level updating node state. Apparently, the agent isn't sending the state updates as we expect when a task state update also happens. While that should be addressed, that's out of scope for this PR. This PR needs to re-add that functionality back for the time being.

src/api-service/__app__/onefuzzlib/agent_events.py

src/api-service/__app__/agent_events/__init__.py

src/api-service/__app__/onefuzzlib/agent_events.py

src/pytypes/onefuzztypes/models.py

demoray and others added 7 commits November 4, 2020 15:58

start refactoring

0733583

rename existing agent_events code

e20a0ae

add new agent_events endpoint

388ed8f

linting

2a603f4

add error logging on process failure

68748da

make error handling check Error and add more logging

1b65ee0

Merge branch 'main' into agent-events-cleanup

644ec4e

chkeita reviewed Nov 4, 2020

View reviewed changes

src/api-service/__app__/onefuzzlib/agent_events.py Outdated Show resolved Hide resolved

chkeita approved these changes Nov 4, 2020

View reviewed changes

ranweiler approved these changes Nov 5, 2020

View reviewed changes

demoray added 4 commits November 5, 2020 10:54

move to using Result[T]

a236a4a

explain why we ignore reimage_requested

bb514fc

flip branch condition and use info on success, not error

cd97077

move to error generation, not exception

7848e78

ranweiler reviewed Nov 5, 2020

View reviewed changes

src/pytypes/onefuzztypes/models.py Outdated Show resolved Hide resolved

demoray and others added 10 commits November 5, 2020 12:06

rename to OkType

1ef2d78

Merge branch 'main' into agent-events-cleanup

ecf811f

set node to "busy" on task start

abadf1c

Merge branch 'main' into agent-events-cleanup

2aaf419

Merge branch 'main' into agent-events-cleanup

8ec6fb8

update from microsoft#273

6ce4b37

Merge branch 'main' into agent-events-cleanup

48f7f5f

add more logging, and only call on_start if we're starting

95d9eaa

Merge branch 'main' into agent-events-cleanup

cef0f2d

Merge remote-tracking branch 'upstream/main' into agent-events-cleanup

5c565ca

Merge branch 'main' into agent-events-cleanup

798c911

bmc-msft merged commit ca209eb into microsoft:main Nov 11, 2020

bmc-msft deleted the agent-events-cleanup branch November 11, 2020 23:28

ghost locked as resolved and limited conversation to collaborators Apr 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor agent_events handler #261

refactor agent_events handler #261

bmc-msft commented Nov 4, 2020 •

edited

Loading

refactor agent_events handler #261

refactor agent_events handler #261

Conversation

bmc-msft commented Nov 4, 2020 • edited Loading

Summary of the Pull Request

Info on Pull Request

Validation Steps Performed

TODO

bmc-msft commented Nov 4, 2020 •

edited

Loading