WIP Worker state transition refactor #4772
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is still very much in flow but so far I got, at least locally, most tests running
This refactors the worker state machine such that it follows a similar execution model as the scheduler where we calculate recommendations and messages during a transition and perform these recommended transitions until we converge and there are no further recommendations.
The overall theme of this change is to be less forgiving in edge cases, log more, raise more often. If something unexpected is happening we do not fail silently.
All connected transitions are also linked using a transaction_id which is generated at the top of the chain and propagated through. While I haven't put this into the logs consistently, this is already added to the transition log such that one can easily follow what the reason of a given transition is/was (this is already possible to calculate based on the recommendations but in logs the ID is helpful)
I have currently a strong suspicion that this state machine can be described by only exit and enter actions (hence the few sporadic
_transition_enter_{}
methods but I haven't converged on this, yet.One major point about this PR is that I get rid of the release_key and distinguish between delete and forget actions. This helps with keeping state like a suspicious counter and helps with understanding what is actually going on
At the very least the following items are to be finished before this can be considered reviewable