Fix race conditions in backend lock #6007

jandubois · 2023-11-21T06:58:47Z

The current implementation of the backend lock can still lead to race conditions, lockups, and crashes (e.g. #5849, #5846, #5695). This change is supposed to fix that.

The lock file

We will use the existence of the lockfile $APPHOME/backend.lock as the only source of truth about the desired backend state:

The file exists: backend should shut down and the VM is considered locked (no backend operations are allowed until the lock is removed).
The file doesn't exist: backend should start and the VM is considered unlocked.

Note that the locked state is no longer independently settable, but is derived from the desired backend state.

Application startup

The application starts the command server API.
When the command server is running, the app sends itself a PUT check_backend_lock command.
The check_backend_lock will check if the lock file exists
The UI must not be shown/enabled before check_backend_lock is executed. There must be no way for the user to start the backend before the lock state is being checked.
If the lock file doesn't exist, it will dismiss any modal locking dialog and start up the backend if it isn't already running.
If the lock file exists, it will shutdown the backend if it is already running. Then it will display a modal locking dialog explaining that the backend is locked. It will also disable any menu commands that might alter the backend state.

rdctl locking/unlocking the backend

rdctl will first create/delete the lock file, and then call the PUT check_backend_lock API. This guarantees that the backend either doesn't start up, or will be shut down.

This way there should be no scenario in which the app could miss the notification that the lock state has changed.

Lock reason

In order to communicate the reason why the backend is locked we'll change the backend.lock file from an empty file to a JSON file¹. For now the file will include a single reason field that can be displayed in the modal lock dialog.

rdctl changes

The changes to rdctl are minimal:

PUT backend_state will be renamed to PUT check_backend_lock and the parameters will be removed.
The reason code will be written to the backend.lock file.

This means the new algorithm is already testable with the current rdctl implementation if PUT backend_state is treated the same as PUT check_backend_lock, and an empty lock file is treated as a generic reason.

It is unclear if we want to rename the file to backend-lock.json. ↩

The text was updated successfully, but these errors were encountered:

jandubois added kind/story Work item that is linked from a kind/epic area/snapshots labels Nov 21, 2023

jandubois added this to the 1.12 milestone Nov 21, 2023

jandubois assigned rak-phillip Nov 21, 2023

gaktive unassigned rak-phillip Dec 4, 2023

gaktive removed this from the 1.12 milestone Dec 4, 2023

jandubois mentioned this issue Dec 7, 2023

Backend-state event should contain details about operation in progress #5683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions in backend lock #6007

Fix race conditions in backend lock #6007

jandubois commented Nov 21, 2023 •

edited

Loading

Fix race conditions in backend lock #6007

Fix race conditions in backend lock #6007

Comments

jandubois commented Nov 21, 2023 • edited Loading

The lock file

Application startup

rdctl locking/unlocking the backend

Lock reason

rdctl changes

Footnotes

jandubois commented Nov 21, 2023 •

edited

Loading