Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability issues of sub-systems #685

Open
fsaintjacques opened this issue Oct 9, 2024 · 0 comments
Open

Stability issues of sub-systems #685

fsaintjacques opened this issue Oct 9, 2024 · 0 comments

Comments

@fsaintjacques
Copy link
Contributor

fsaintjacques commented Oct 9, 2024

The fact that sub-systems failures triggers a full process shutdown is an architecture decision that produce instability, we've encountered multiple scenarios where this was causing the process to self-shutdown:

delete-workspace-shutdown

I beleve that such sub-systems should just log the errors, increment a metric and move on.

leg100 added a commit that referenced this issue Oct 18, 2024
The scheduler ensures only one run on each workspace is permitted to be
running (the exception to this rule are "plan-only" runs, which we won't
discuss here). A run begins in the `pending` state. It cannot transition
to the next state, `plan_queued`, until:

(a) any other pending runs created before it have finished.
(b) its workspace is unlocked.

Once both conditions are true, the scheduler does the following:

(a) locks the workspace 
(b) updates the status of the run to `plan_queued`
(c) sets the run as the "current run" of the workspace

Once a run finishes it is also responsible for unlocking the workspace
once a run completes. One of the bugs identified by the user is a race
condition that occurs whenever a run finishes and its workspace is
immediately deleted (this is not an untypical scenario, where you're
testing changes on an "ephemeral" workspace): the scheduler receives the
"run completed" event and often by this time the workspace has already
been deleted, but the scheduler isn't aware of that, and it tries to
unlock the workspace and it receives an error. This shouldn't be an
issue because the error is "workspace not found" and the scheduler
should understand that that means the workspace has since been deleted,
no action need be taken and to move on. But instead it errantly
interpets it as a transient error, and backs off and retries. The fix
here is clear.

Another race condition occurs when the "run completed" event is received
*after* the "workspace deleted" event. The scheduler processes the
latter event and deletes its cached workspace accordingly. It then
receives the "run completed" event and tries to lookup its workspace in
its cache and it cannot find it. It reports this as an error to the
user, and moves on. The "fix" here is to either accept this as an
entirely reasonable race condition and suppress the error message; or to
make a change to ensure events are processed in order. In this case I've
opted for the former.

#685
leg100 added a commit that referenced this issue Oct 18, 2024
#685 reports the Github limit of 1000 commit status updates being
routinely hit.

This PR does too things to avoid this limit being hit:
* ensures the same status update is not sent more than once, thanks to
the use of a cache.
* removes the `running` abstract VCS status, for which Github has no
equivalent, and is not actually used anywhere else in OTF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant