Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistent db module and recovery logic #1229

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

cloudnoize
Copy link

@cloudnoize cloudnoize commented Dec 23, 2024

  • Introduce a new persistent DB module.
  • Create tables and logic to persist the data needed to recovery the Aggregator after restart.
  • Use the database from the Aggregator.
  • Introduce recovery logic to use the persisted data to recover the Aggregator state as it was before the crash.
  • Fix for collaborator to send correct tensor key.

@cloudnoize cloudnoize changed the title Add persistent db module and recovery logic [WIP]Add persistent db module and recovery logic Dec 23, 2024
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch 6 times, most recently from 727bcf2 to 3d86102 Compare December 30, 2024 09:41
@cloudnoize cloudnoize changed the title [WIP]Add persistent db module and recovery logic Add persistent db module and recovery logic Jan 6, 2025
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch from 3d86102 to c1eaa0b Compare January 6, 2025 08:50
Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @cloudnoize, based on a first read-through! I have a couple of questions/comments:

openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
openfl/component/collaborator/collaborator.py Outdated Show resolved Hide resolved
openfl/databases/persistent_db.py Outdated Show resolved Hide resolved
openfl/databases/persistent_db.py Outdated Show resolved Hide resolved
openfl/databases/persistent_db.py Outdated Show resolved Hide resolved
openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
openfl/component/aggregator/aggregator.py Outdated Show resolved Hide resolved
self.lock = Lock()
self.use_delta_updates = use_delta_updates

if self._recover():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should be determined by the addition of a experiment_checkpoint configuration in plan.yaml:

i.e.

aggregator:
...
    experiment_checkpoint: True
Suggested change
if self._recover():
if experiment_checkpoint:
self._recover()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, there needs to be some kind of check that the previously serialized state matches the current plan. This can be accomplished by storing the federation_uuid, which is determined by taking a hash of the plan.yaml file. Changing a plan in the aggregator's current workspace, then restoring state with the older configuration could result in unexpected behavior.

Copy link
Author

@cloudnoize cloudnoize Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it valid to change plan on a running experiment?
what shall be done in case of a mismatch? drop data and start from scratch? does it require notifying the collaborators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psfoley can we discuss it and implement it in a separate PR?

Copy link
Contributor

@psfoley psfoley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @cloudnoize. This is a step in the right direction, and will certainly improve user experience to permit experiment restarts after failure.

My comments for the immediate PR can be grouped into the following feedback:

  1. Additional logs should be limited, and set to debug or condensed into a single line when state is being saved or restored.
  2. There should be a plan configuration to enable / disable the persistent DB as there are space implications of storing all of the aggregator's state
  3. There should be stronger checks in place to verify the previously saved state matches the restarted experiment (see individual comments for details).

As an aside - In a separate PR before OpenFL 1.8 is released I think there would be benefit in modifying the existing TensorDB / PersistentDB behind a common interface to store all state information. This should lead to cleaner in-memory / persistent DB options in the future.

Copy link
Collaborator

@kta-intel kta-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good overall, thanks @cloudnoize! In additional to the other reviews, I have some comments below

openfl/component/collaborator/collaborator.py Outdated Show resolved Hide resolved
openfl/component/collaborator/collaborator.py Outdated Show resolved Hide resolved
openfl/databases/persistent_db.py Outdated Show resolved Hide resolved
openfl/databases/persistent_db.py Outdated Show resolved Hide resolved
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch from ec832f1 to 9b9a65a Compare January 7, 2025 15:18
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch from 9b9a65a to bfe159b Compare January 8, 2025 09:18
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch from b74a7e1 to 39870a4 Compare January 9, 2025 09:05
@cloudnoize cloudnoize force-pushed the elerer/persistent_db branch from 39870a4 to d336c1d Compare January 9, 2025 09:08
Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @cloudnoize - this change will make extended OpenFL experiments much more resilient!

Comment on lines +87 to +88
persist_checkpoint=True,
persistent_db_path=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If persistent_db_path is None by default, then I suggest setting persist_checkpoint=False by default accordingly.

PS: As part of the bump into OpenFL-Security, you should also set the persist_checkpoint param in the defaults there:
https://github.com/intel-innersource/frameworks.ai.openfl.openfl-security/blob/develop/client/openfl_security_workspace/workspace/plan/defaults/aggregator.yaml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the None db path is to use the current workdir i.e. workspace, not to hint that the feature is disabled,
I added it as preparation for the secure-fl where we want to set it to a dedicated path for the encrypted fs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants