-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add persistent db module and recovery logic #1229
base: develop
Are you sure you want to change the base?
Add persistent db module and recovery logic #1229
Conversation
cloudnoize
commented
Dec 23, 2024
•
edited
Loading
edited
- Introduce a new persistent DB module.
- Create tables and logic to persist the data needed to recovery the Aggregator after restart.
- Use the database from the Aggregator.
- Introduce recovery logic to use the persisted data to recover the Aggregator state as it was before the crash.
- Fix for collaborator to send correct tensor key.
727bcf2
to
3d86102
Compare
3d86102
to
c1eaa0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @cloudnoize, based on a first read-through! I have a couple of questions/comments:
self.lock = Lock() | ||
self.use_delta_updates = use_delta_updates | ||
|
||
if self._recover(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should be determined by the addition of a experiment_checkpoint
configuration in plan.yaml:
i.e.
aggregator:
...
experiment_checkpoint: True
if self._recover(): | |
if experiment_checkpoint: | |
self._recover() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, there needs to be some kind of check that the previously serialized state matches the current plan. This can be accomplished by storing the federation_uuid
, which is determined by taking a hash of the plan.yaml file. Changing a plan in the aggregator's current workspace, then restoring state with the older configuration could result in unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it valid to change plan on a running experiment?
what shall be done in case of a mismatch? drop data and start from scratch? does it require notifying the collaborators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psfoley can we discuss it and implement it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @cloudnoize. This is a step in the right direction, and will certainly improve user experience to permit experiment restarts after failure.
My comments for the immediate PR can be grouped into the following feedback:
- Additional logs should be limited, and set to
debug
or condensed into a single line when state is being saved or restored. - There should be a plan configuration to enable / disable the persistent DB as there are space implications of storing all of the aggregator's state
- There should be stronger checks in place to verify the previously saved state matches the restarted experiment (see individual comments for details).
As an aside - In a separate PR before OpenFL 1.8 is released I think there would be benefit in modifying the existing TensorDB / PersistentDB behind a common interface to store all state information. This should lead to cleaner in-memory / persistent DB options in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good overall, thanks @cloudnoize! In additional to the other reviews, I have some comments below
ec832f1
to
9b9a65a
Compare
9b9a65a
to
bfe159b
Compare
b74a7e1
to
39870a4
Compare
39870a4
to
d336c1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @cloudnoize - this change will make extended OpenFL experiments much more resilient!
persist_checkpoint=True, | ||
persistent_db_path=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If persistent_db_path
is None by default, then I suggest setting persist_checkpoint=False
by default accordingly.
PS: As part of the bump into OpenFL-Security, you should also set the persist_checkpoint
param in the defaults there:
https://github.com/intel-innersource/frameworks.ai.openfl.openfl-security/blob/develop/client/openfl_security_workspace/workspace/plan/defaults/aggregator.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the None
db path is to use the current workdir i.e. workspace, not to hint that the feature is disabled,
I added it as preparation for the secure-fl where we want to set it to a dedicated path for the encrypted fs.