Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ready for review] Make Clone-only resume more efficient. #1719

Merged
merged 9 commits into from
Mar 13, 2024

Conversation

darinyu
Copy link
Collaborator

@darinyu darinyu commented Feb 6, 2024

Current clone-only resume has a few flaws we need to solve:

  1. Each task to be resume will create a resume runtime and they all want to resume the previously completed execution tree in topological order. This resulted in unnecessary competition between resume tasks.
  2. If one resume wins the race (registered new task id) but never completed the cloning, the future attempts will never run and the resume will never finish.

To tackle this, we decided to elect a resume leader (whichever tasks register _parameters task first) and also remembers the resume leader path. All non-leader will need to wait for the leader to finish resuming (which essentially is copying the datastore pointer to new place).

In the case of a failed clone, we will check whether task_id is registered and task is complete to ensure that we will still clone the old run when task_id is already registered but never complete.

To test this, one can try this flow.

python flow.py run
python flow.py resume

@darinyu darinyu changed the title Make Clone-only resume more efficient. [Ready for review] Make Clone-only resume more efficient. Feb 7, 2024
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smallish nits but otherwise looks good.

metaflow/runtime.py Outdated Show resolved Hide resolved
metaflow/cli.py Outdated Show resolved Hide resolved
metaflow/runtime.py Show resolved Hide resolved
)

# Check if we should be the resume leader (maybe from previous attempt)
if ds.has_metadata("_resume_leader", add_attempt=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky here -- there is a possible race where the resume leader is elected but never writes out the metadata which would mean that then no one is the resume leader. We may want to comment on this here but I think we can ignore it for now as it really shouldn't happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine here because it is under "task_id_exists_already=True", which means

  1. We are in first attempt, since task_id_exists_already is true, we can not be leader anyways.
  2. We are in second attempt, and we could be leader from writes in previous attempt (no race here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline, added code to retry (to avoid race condition)

metaflow/runtime.py Outdated Show resolved Hide resolved
metaflow/runtime.py Outdated Show resolved Hide resolved
metaflow/task.py Show resolved Hide resolved
@romain-intel romain-intel requested a review from savingoyal March 6, 2024 07:53
Copy link
Collaborator

@saikonen saikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be working as expected based on my limited testing, should be good to go.

@romain-intel romain-intel merged commit 98f3f2a into master Mar 13, 2024
33 of 34 checks passed
@romain-intel romain-intel deleted the test_resume branch March 13, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants