Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

darinyu
Copy link
Collaborator

@darinyu darinyu commented Feb 8, 2024

No description provided.

@darinyu darinyu changed the title retrieve entire graph then clone [WIP] retrieve entire graph then clone Feb 9, 2024
@darinyu darinyu force-pushed the test_resume_experiment branch 3 times, most recently from 346cfe0 to 702573a Compare February 27, 2024 23:11
@darinyu darinyu changed the title [WIP] retrieve entire graph then clone [Ready for review] Retrieve original run tasks and clone successful ones directly Feb 27, 2024
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks fine for the most part. My main comment/concern is that I think it could be expanded to cover the regular resume as well to speed that one up too. Right now, this only speeds up the reentrant resume if I am reading this right. WIth OB's approval of course but I feel that we would also benefit even for the regular resume (which is also a legit use case if not the most pressing one).

@@ -802,7 +802,10 @@ def resume(
write_run_id(run_id_file, runtime.run_id)
runtime.print_workflow_info()
runtime.persist_constants()
runtime.execute()
if clone_only:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change the behavior only for our internal use case of clone first and then execute but it feels like this mechanism would be able to work even for a regular resume by first cloning and then continuing execution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline, lets make this PR focusing on clone-only behavior first.

metaflow/runtime.py Show resolved Hide resolved
metaflow/task.py Outdated Show resolved Hide resolved
@darinyu darinyu force-pushed the test_resume branch 2 times, most recently from 781605a to 3a77db1 Compare February 29, 2024 18:50
@darinyu darinyu force-pushed the test_resume_experiment branch from 702573a to 16baeb1 Compare February 29, 2024 18:50
@darinyu darinyu force-pushed the test_resume_experiment branch from 16baeb1 to c97026c Compare March 11, 2024 23:19
Copy link
Collaborator

@saikonen saikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing context on --clone-wait-only use case, but apart from the minor changes this should be good to go.

Comment on lines 17 to 18
print(
f"Cloning task from {flow_name}/{clone_run_id}/{step_name}/{task_id} to {flow_name}/{run_id}/{step_name}/{task_id}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional print or for debugging purposes? The stdout for this seems to differ from regular MFLOG lines.

also as this is called as part of core, the f-string will break for old python so at least change that to older style if the output is intentional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few other occurences of f-strings need changes as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated print and f-string

@darinyu darinyu force-pushed the test_resume_experiment branch from 4f9e3c9 to f842b52 Compare March 13, 2024 16:12
@romain-intel
Copy link
Contributor

For reference and posterity: the way we use resume is in one of two ways:

  • the usual: "clone all tasks and continue executing in runtime.py"
  • a distributed manner where a scheduler (argo/airflow) "resumes" the run (the execution frontier of tasks is relaunched by the scheduler and then all those tasks try to "resume") and then each task will execute its own thing and the scheduler will then continue. This is effectively a resume but without a runtime.py to drive it.

This PR is the second in the line to improve that second behavior:

  • the first elected a leader to do the cloning as opposed to doing it in a distributed fashion as we used to do
  • this one makes it also a lot faster to clone the tasks (ie: the job of the leader is simplified).

Base automatically changed from test_resume to master March 13, 2024 16:25
saikonen
saikonen previously approved these changes Mar 13, 2024
Copy link
Collaborator

@saikonen saikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a rebase due to target branch getting merged, but otherwise looks good to go.

add more print

profile time and threads

workers with clone only command

include origin_ds_set and s3 batch write

clean up pr for review

remove wait-only flag because it is no longer used

address comments

fix OSS resume
@darinyu darinyu force-pushed the test_resume_experiment branch from f842b52 to 8d4619e Compare March 13, 2024 18:36
@wangchy27 wangchy27 merged commit 7e265ac into master Mar 13, 2024
30 of 34 checks passed
@wangchy27 wangchy27 deleted the test_resume_experiment branch March 13, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants