[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

darinyu · 2024-02-08T22:17:37Z

No description provided.

romain-intel

I think this looks fine for the most part. My main comment/concern is that I think it could be expanded to cover the regular resume as well to speed that one up too. Right now, this only speeds up the reentrant resume if I am reading this right. WIth OB's approval of course but I feel that we would also benefit even for the regular resume (which is also a legit use case if not the most pressing one).

romain-intel · 2024-02-29T08:48:01Z

metaflow/cli.py

@@ -802,7 +802,10 @@ def resume(
    write_run_id(run_id_file, runtime.run_id)
    runtime.print_workflow_info()
    runtime.persist_constants()
-    runtime.execute()
+    if clone_only:


This seems to change the behavior only for our internal use case of clone first and then execute but it feels like this mechanism would be able to work even for a regular resume by first cloning and then continuing execution?

talked offline, lets make this PR focusing on clone-only behavior first.

metaflow/runtime.py

metaflow/task.py

saikonen

missing context on --clone-wait-only use case, but apart from the minor changes this should be good to go.

saikonen · 2024-03-13T14:55:34Z

metaflow/clone_util.py

+    print(
+        f"Cloning task from {flow_name}/{clone_run_id}/{step_name}/{task_id} to {flow_name}/{run_id}/{step_name}/{task_id}"


Is this an intentional print or for debugging purposes? The stdout for this seems to differ from regular MFLOG lines.

also as this is called as part of core, the f-string will break for old python so at least change that to older style if the output is intentional.

a few other occurences of f-strings need changes as well.

Updated print and f-string

romain-intel · 2024-03-13T16:23:50Z

For reference and posterity: the way we use resume is in one of two ways:

the usual: "clone all tasks and continue executing in runtime.py"
a distributed manner where a scheduler (argo/airflow) "resumes" the run (the execution frontier of tasks is relaunched by the scheduler and then all those tasks try to "resume") and then each task will execute its own thing and the scheduler will then continue. This is effectively a resume but without a runtime.py to drive it.

This PR is the second in the line to improve that second behavior:

the first elected a leader to do the cloning as opposed to doing it in a distributed fashion as we used to do
this one makes it also a lot faster to clone the tasks (ie: the job of the leader is simplified).

saikonen

needs a rebase due to target branch getting merged, but otherwise looks good to go.

add more print profile time and threads workers with clone only command include origin_ds_set and s3 batch write clean up pr for review remove wait-only flag because it is no longer used address comments fix OSS resume

darinyu changed the title ~~retrieve entire graph then clone~~ [WIP] retrieve entire graph then clone Feb 9, 2024

darinyu force-pushed the test_resume_experiment branch 3 times, most recently from 346cfe0 to 702573a Compare February 27, 2024 23:11

darinyu changed the title ~~[WIP] retrieve entire graph then clone~~ [Ready for review] Retrieve original run tasks and clone successful ones directly Feb 27, 2024

romain-intel reviewed Feb 29, 2024

View reviewed changes

darinyu force-pushed the test_resume branch 2 times, most recently from 781605a to 3a77db1 Compare February 29, 2024 18:50

darinyu force-pushed the test_resume_experiment branch from 702573a to 16baeb1 Compare February 29, 2024 18:50

darinyu force-pushed the test_resume_experiment branch from 16baeb1 to c97026c Compare March 11, 2024 23:19

saikonen requested changes Mar 13, 2024

View reviewed changes

darinyu force-pushed the test_resume_experiment branch from 4f9e3c9 to f842b52 Compare March 13, 2024 16:12

Base automatically changed from test_resume to master March 13, 2024 16:25

saikonen previously approved these changes Mar 13, 2024

View reviewed changes

retrieve entire graph then clone

8d4619e

add more print profile time and threads workers with clone only command include origin_ds_set and s3 batch write clean up pr for review remove wait-only flag because it is no longer used address comments fix OSS resume

darinyu dismissed saikonen’s stale review via 8d4619e March 13, 2024 18:36

darinyu force-pushed the test_resume_experiment branch from f842b52 to 8d4619e Compare March 13, 2024 18:36

wangchy27 merged commit 7e265ac into master Mar 13, 2024
30 of 34 checks passed

wangchy27 deleted the test_resume_experiment branch March 13, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

darinyu commented Feb 8, 2024

romain-intel left a comment

romain-intel Feb 29, 2024

darinyu Mar 11, 2024

saikonen left a comment

saikonen Mar 13, 2024

saikonen Mar 13, 2024

darinyu Mar 13, 2024

romain-intel commented Mar 13, 2024

saikonen left a comment

		print(
		f"Cloning task from {flow_name}/{clone_run_id}/{step_name}/{task_id} to {flow_name}/{run_id}/{step_name}/{task_id}"

[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

[Ready for review] Retrieve original run tasks and clone successful ones directly #1728

Conversation

darinyu commented Feb 8, 2024

romain-intel left a comment

Choose a reason for hiding this comment

romain-intel Feb 29, 2024

Choose a reason for hiding this comment

darinyu Mar 11, 2024

Choose a reason for hiding this comment

saikonen left a comment

Choose a reason for hiding this comment

saikonen Mar 13, 2024

Choose a reason for hiding this comment

saikonen Mar 13, 2024

Choose a reason for hiding this comment

darinyu Mar 13, 2024

Choose a reason for hiding this comment

romain-intel commented Mar 13, 2024

saikonen left a comment

Choose a reason for hiding this comment