Add persistent db module and recovery logic #1229

cloudnoize · 2024-12-23T21:53:43Z

Introduce a new persistent DB module.
Create tables and logic to persist the data needed to recovery the Aggregator after restart.
Use the database from the Aggregator.
Introduce recovery logic to use the persisted data to recover the Aggregator state as it was before the crash.
Fix for collaborator to send correct tensor key.

teoparvanov

LGTM, @cloudnoize, based on a first read-through! I have a couple of questions/comments:

openfl/component/aggregator/aggregator.py

openfl/component/collaborator/collaborator.py

openfl/databases/persistent_db.py

openfl/component/aggregator/aggregator.py

openfl/databases/persistent_db.py

openfl/component/aggregator/aggregator.py

psfoley · 2025-01-06T22:13:25Z

openfl/component/aggregator/aggregator.py

+        self.lock = Lock()
+        self.use_delta_updates = use_delta_updates
+
+        if self._recover():


Again, this should be determined by the addition of a experiment_checkpoint configuration in plan.yaml:

i.e.

aggregator: ... experiment_checkpoint: True

Suggested change

if self._recover():

if experiment_checkpoint:

self._recover()

In addition, there needs to be some kind of check that the previously serialized state matches the current plan. This can be accomplished by storing the federation_uuid, which is determined by taking a hash of the plan.yaml file. Changing a plan in the aggregator's current workspace, then restoring state with the older configuration could result in unexpected behavior.

is it valid to change plan on a running experiment?
what shall be done in case of a mismatch? drop data and start from scratch? does it require notifying the collaborators?

@psfoley can we discuss it and implement it in a separate PR?

psfoley

Nice work, @cloudnoize. This is a step in the right direction, and will certainly improve user experience to permit experiment restarts after failure.

My comments for the immediate PR can be grouped into the following feedback:

Additional logs should be limited, and set to debug or condensed into a single line when state is being saved or restored.
There should be a plan configuration to enable / disable the persistent DB as there are space implications of storing all of the aggregator's state
There should be stronger checks in place to verify the previously saved state matches the restarted experiment (see individual comments for details).

As an aside - In a separate PR before OpenFL 1.8 is released I think there would be benefit in modifying the existing TensorDB / PersistentDB behind a common interface to store all state information. This should lead to cleaner in-memory / persistent DB options in the future.

kta-intel

This is looking good overall, thanks @cloudnoize! In additional to the other reviews, I have some comments below

openfl/component/collaborator/collaborator.py

openfl/databases/persistent_db.py

openfl/component/aggregator/aggregator.py

teoparvanov

Looks great, @cloudnoize - this change will make extended OpenFL experiments much more resilient!

teoparvanov · 2025-01-09T10:20:32Z

openfl/component/aggregator/aggregator.py

+        persist_checkpoint=True,
+        persistent_db_path=None


If persistent_db_path is None by default, then I suggest setting persist_checkpoint=False by default accordingly.

PS: As part of the bump into OpenFL-Security, you should also set the persist_checkpoint param in the defaults there:
https://github.com/intel-innersource/frameworks.ai.openfl.openfl-security/blob/develop/client/openfl_security_workspace/workspace/plan/defaults/aggregator.yaml

the None db path is to use the current workdir i.e. workspace, not to hint that the feature is disabled,
I added it as preparation for the secure-fl where we want to set it to a dedicated path for the encrypted fs.

openfl-workspace/workspace/plan/defaults/aggregator.yaml

cloudnoize changed the title ~~Add persistent db module and recovery logic~~ [WIP]Add persistent db module and recovery logic Dec 23, 2024

cloudnoize force-pushed the elerer/persistent_db branch 6 times, most recently from 727bcf2 to 3d86102 Compare December 30, 2024 09:41

cloudnoize changed the title ~~[WIP]Add persistent db module and recovery logic~~ Add persistent db module and recovery logic Jan 6, 2025

Add persistent db module and recovery logic

c1eaa0b

cloudnoize force-pushed the elerer/persistent_db branch from 3d86102 to c1eaa0b Compare January 6, 2025 08:50

teoparvanov reviewed Jan 6, 2025

View reviewed changes

psfoley reviewed Jan 6, 2025

View reviewed changes

openfl/databases/persistent_db.py Outdated Show resolved Hide resolved

psfoley reviewed Jan 6, 2025

View reviewed changes

openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved

psfoley reviewed Jan 6, 2025

View reviewed changes

kta-intel reviewed Jan 6, 2025

View reviewed changes

cloudnoize force-pushed the elerer/persistent_db branch from ec832f1 to 9b9a65a Compare January 7, 2025 15:18

psfoley reviewed Jan 8, 2025

View reviewed changes

openfl/component/aggregator/aggregator.py Show resolved Hide resolved

Address code review comments

bfe159b

cloudnoize force-pushed the elerer/persistent_db branch from 9b9a65a to bfe159b Compare January 8, 2025 09:18

Adding persist_checkpoint flag to the plan

b6dd6e5

cloudnoize force-pushed the elerer/persistent_db branch from b74a7e1 to 39870a4 Compare January 9, 2025 09:05

Handling next round model tensors

d336c1d

cloudnoize force-pushed the elerer/persistent_db branch from 39870a4 to d336c1d Compare January 9, 2025 09:08

teoparvanov approved these changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persistent db module and recovery logic #1229

Add persistent db module and recovery logic #1229

cloudnoize commented Dec 23, 2024 •

edited

Loading

teoparvanov left a comment

psfoley Jan 6, 2025

psfoley Jan 6, 2025

cloudnoize Jan 8, 2025 •

edited

Loading

cloudnoize Jan 9, 2025

psfoley left a comment

kta-intel left a comment

teoparvanov left a comment •

edited

Loading

teoparvanov Jan 9, 2025

cloudnoize Jan 9, 2025

Add persistent db module and recovery logic #1229

Are you sure you want to change the base?

Add persistent db module and recovery logic #1229

Conversation

cloudnoize commented Dec 23, 2024 • edited Loading

teoparvanov left a comment

Choose a reason for hiding this comment

psfoley Jan 6, 2025

Choose a reason for hiding this comment

psfoley Jan 6, 2025

Choose a reason for hiding this comment

cloudnoize Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

cloudnoize Jan 9, 2025

Choose a reason for hiding this comment

psfoley left a comment

Choose a reason for hiding this comment

kta-intel left a comment

Choose a reason for hiding this comment

teoparvanov left a comment • edited Loading

Choose a reason for hiding this comment

teoparvanov Jan 9, 2025

Choose a reason for hiding this comment

cloudnoize Jan 9, 2025

Choose a reason for hiding this comment

cloudnoize commented Dec 23, 2024 •

edited

Loading

cloudnoize Jan 8, 2025 •

edited

Loading

teoparvanov left a comment •

edited

Loading