Stability issues of sub-systems #685

fsaintjacques · 2024-10-09T18:14:33Z

The fact that sub-systems failures triggers a full process shutdown is an architecture decision that produce instability, we've encountered multiple scenarios where this was causing the process to self-shutdown:

Brokers' queue full: https://github.com/Optable/optable-otf/commit/b5adc7b8a36e138e1c4b4491db444c82ebbef0e4
Github notification failures (422 because too many tests on a single commit) can cause a process shutdown
Races in how hooks are handled:
- Trigger a destroy run & Immediately destroy the workspace
- Hooks to unlock workspace fails due to workspace not found (see image)

I beleve that such sub-systems should just log the errors, increment a metric and move on.

The scheduler ensures only one run on each workspace is permitted to be running (the exception to this rule are "plan-only" runs, which we won't discuss here). A run begins in the `pending` state. It cannot transition to the next state, `plan_queued`, until: (a) any other pending runs created before it have finished. (b) its workspace is unlocked. Once both conditions are true, the scheduler does the following: (a) locks the workspace (b) updates the status of the run to `plan_queued` (c) sets the run as the "current run" of the workspace Once a run finishes it is also responsible for unlocking the workspace once a run completes. One of the bugs identified by the user is a race condition that occurs whenever a run finishes and its workspace is immediately deleted (this is not an untypical scenario, where you're testing changes on an "ephemeral" workspace): the scheduler receives the "run completed" event and often by this time the workspace has already been deleted, but the scheduler isn't aware of that, and it tries to unlock the workspace and it receives an error. This shouldn't be an issue because the error is "workspace not found" and the scheduler should understand that that means the workspace has since been deleted, no action need be taken and to move on. But instead it errantly interpets it as a transient error, and backs off and retries. The fix here is clear. Another race condition occurs when the "run completed" event is received *after* the "workspace deleted" event. The scheduler processes the latter event and deletes its cached workspace accordingly. It then receives the "run completed" event and tries to lookup its workspace in its cache and it cannot find it. It reports this as an error to the user, and moves on. The "fix" here is to either accept this as an entirely reasonable race condition and suppress the error message; or to make a change to ensure events are processed in order. In this case I've opted for the former. #685

#685 reports the Github limit of 1000 commit status updates being routinely hit. This PR does too things to avoid this limit being hit: * ensures the same status update is not sent more than once, thanks to the use of a cache. * removes the `running` abstract VCS status, for which Github has no equivalent, and is not actually used anywhere else in OTF

This was referenced Oct 17, 2024

fix: avoid hitting Github limit on commit status updates #688

Merged

feat: allow subscription buffer size to be overridden #687

Merged

fix: don't unnecessarily restart scheduler #689

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability issues of sub-systems #685

Stability issues of sub-systems #685

fsaintjacques commented Oct 9, 2024 •

edited

Loading

Stability issues of sub-systems #685

Stability issues of sub-systems #685

Comments

fsaintjacques commented Oct 9, 2024 • edited Loading

fsaintjacques commented Oct 9, 2024 •

edited

Loading