-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ready for review] Make Clone-only resume more efficient. #1719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smallish nits but otherwise looks good.
metaflow/runtime.py
Outdated
) | ||
|
||
# Check if we should be the resume leader (maybe from previous attempt) | ||
if ds.has_metadata("_resume_leader", add_attempt=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky here -- there is a possible race where the resume leader is elected but never writes out the metadata which would mean that then no one is the resume leader. We may want to comment on this here but I think we can ignore it for now as it really shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine here because it is under "task_id_exists_already=True", which means
- We are in first attempt, since task_id_exists_already is true, we can not be leader anyways.
- We are in second attempt, and we could be leader from writes in previous attempt (no race here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked offline, added code to retry (to avoid race condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be working as expected based on my limited testing, should be good to go.
Current clone-only resume has a few flaws we need to solve:
To tackle this, we decided to elect a resume leader (whichever tasks register _parameters task first) and also remembers the resume leader path. All non-leader will need to wait for the leader to finish resuming (which essentially is copying the datastore pointer to new place).
In the case of a failed clone, we will check whether task_id is registered and task is complete to ensure that we will still clone the old run when task_id is already registered but never complete.
To test this, one can try this flow.