Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Artifacts of Components in ParallelFor / Sub-DAGs Overwritten by Concurrent Iterations #10186

Closed
TristanGreathouse opened this issue Oct 31, 2023 · 6 comments
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Milestone

Comments

@TristanGreathouse
Copy link

Environment

  • How do you deploy Kubeflow Pipelines (KFP)?
    Using AWS.

  • KFP version:
    1.8-rc5

  • KFP SDK version:

kfp                        2.3.0                  
kfp-kubernetes             1.0.0                  
kfp-pipeline-spec          0.2.2                  
kfp-server-api             2.0.2

Steps to reproduce

The problem occurs when we run the same component across ParallelFor iterations OR sub-dags. I discovered the error due to observing unexpected behavior when running nested parallelfor loops, although it originates from the sub-dag structure itself and is not specific to ParallelFor loops.

What happens is that each component overwrites the artifacts of all other components created from the same component template / function because the auto-generated path for S3 is the same across ParallelFor iterations / sub-dags . The path is s3://{bucket}/{pipeline_name}/{run_id}/{component_name}/{output_path_name} . This means that if you run the same component for each iteration of the ParallelFor or across duplicate sub-dags, it overwrites itself in S3 (at least that's my theory).

The deceptive part of this is that there is no explicit failure, but rather unexpected results originating from this overwriting of artifacts. Some examples of how this can happen are as follows. My general use-case to test this is simply creating a dataframe with a column filled with a passed string value. You can then see in the input/output artifact preview tab if the artifact has the correct column. For example, in the screenshot below, the value passed for this particular iteration of a ParallelFor is b, but the value in the displayed output artifact is a.

Screen Shot 2023-10-30 at 6 20 38 PM

SUB-DAGS

The compiler to recreate the problem in sub-dags is as follows:

@dsl.component(packages_to_install=['pandas'])
def parent(val: str, output_path: Output[Dataset]):
    import pandas as pd
    data = pd.DataFrame({"a": [val] * 3, "b": [1,2,3]})
    print("data:\n", data, flush=True)
    data.to_csv(output_path.path, index=False)

@dsl.pipeline(pipeline_root="s3://beta-kf-1-8-test")
def sub_dag(val: str):
    parent_component = parent(val=val)
    parent_component.set_caching_options(False)
    parent_component.set_env_variable(name="AWS_REGION", value="us-east-1")

@dsl.pipeline(pipeline_root="s3://beta-kf-1-8-test")
def compile_pipeline(val_1: str = "a", val_2: str = "b"):
    sub_dag(val=val_1)
    sub_dag(val=val_2)

ParallelFor

The compiler to create the problem in ParallelFor is as follows.

@dsl.component(packages_to_install=['pandas'])
def parent(val: str, output_path: Output[Dataset]):
    import pandas as pd
    data = pd.DataFrame({"a": [val] * 3, "b": [1,2,3]})
    data.to_csv(output_path.path, index=False)

@dsl.pipeline(pipeline_root="s3://beta-kf-1-8-test")
def compile_pipeline(l: list = ["a", "b", "c"]):
    with dsl.ParallelFor(l) as val:
        parent_component = parent(val=val)
        parent_component.set_caching_options(False)
        parent_component.set_env_variable(name="AWS_REGION", value="us-east-1")

Duplicate Components Same Level
I did test and confirm that duplicating the same component in the same-level of a DAG did not run into this issue as the auto-generated output paths were unique. The compiler for this test is here:

@dsl.component(packages_to_install=['pandas'])
def parent(val: str, output_path: Output[Dataset]):
    import pandas as pd
    data = pd.DataFrame({"a": [val] * 3, "b": [1,2,3]})
    data.to_csv(output_path.path, index=False)

@dsl.pipeline(pipeline_root="s3://beta-kf-1-8-test")
def compile_pipeline(val_1: str = "a", val_2: str = "b"):
    parent_component = parent(val=val_1)
    parent_component.set_caching_options(False)
    parent_component.set_env_variable(name="AWS_REGION", value="us-east-1")

    parent_component_2 = parent(val=val_2)
    parent_component_2.set_caching_options(False)
    parent_component_2.set_env_variable(name="AWS_REGION", value="us-east-1")

This issue essentially makes each iteration of a ParallelFor / sub-dag non-unique and not dependent on the specific inputs provided, and the artifacts generated from it non-determinstic. I believe it may be related to the fix introduced in this PR by @chensun from last week.

Expected result

The ParallelFor / sub-dag structures should not overwrite other components executed in concurrent ParallelFor iterations / sub-dags. We should be able to refer to the output artifact of a component within an iteration of a ParallelFor or sub-dag and retrieve the artifact that was generated within the same iteration (rather than a parallel one that has different args and thus different outputs).


Impacted by this bug? Give it a 👍.

@TristanGreathouse TristanGreathouse changed the title [backend] <Bug Name> [backend] Artifacts of Components in ParallelFor / Sub-DAGs Overwritten by Concurrent Iterations Oct 31, 2023
@TristanGreathouse
Copy link
Author

@chensun we've been trying to figure this one out ourselves and have some ideas, but are also unclear on how to test them properly.

The root of the problem is that the output path in V2 is named:

s3://{bucket}/{pipeline_name}/{run_id}/{component_name}/{output_path_name} 

which is not unique across sub-dags. In V1 it was:

s3://{bucket}/artifacts/{workflow_id}/{pod_id}/{output_path_name}

Which is unique due to the pod-id being present. Thus, we are trying to figure out if it's possible to simple update the artifact path to include the pod-name, as that should in theory fix our issue.

The function generateOutputURI in driver.go seems to be where the URI is itself generated. We propose that potentially changing:

return fmt.Sprintf("%s/%s", strings.TrimRight(root, "/"), path.Join(taskName, artifactName))

to:

return fmt.Sprintf("%s/%s", strings.TrimRight(root, "/"), path.Join(taskName, component.EnvPodName, artifactName))

Could fix the issue as it'd make the output URIs unique across sub-dags.

The problem is, I believe this fix will require the driver and launcher Dockerfiles to be updated, but in the argo.go compiler, these images are hard-coded and there is a TODO assigned to yourself to update this logic. They also are not including in the list of docker images to build in the Makefile.

Could you assist us with the fix and/or advise us on how to properly build and incorporate the necessary images?

CC: @catapulta @sachdevayash1910

@sachdevayash1910
Copy link

Following up to the comment above, have the following questions regarding building the backend if we were to go ahead and do so. Any input on this would help:
Question:
-Does building the backend involve just running make on the makefile and building the launcher,driver images only?

  • Lets say we do build the launcher,driver images. Is there just one place where they are hardcoded?, will changing this break anything
  • Doesn't look like we would need to build the cache images (deployer). Does that sound right
  • Same question for metadata-writer
  • to summarize, for the change we are looking to make we will need to build: apiserver, persistance-agent, cache-server, scheduledworkflow, viewercontroller, visualization, launcer,driver.
  • Do we need to set KFP_VERSION env variable explicitly somewhere such that the visualization-server image tag gets updated here

@chensun
Copy link
Member

chensun commented Nov 2, 2023

Thank you @TristanGreathouse and @sachdevayash1910 for digging into the issue. The proposed fix looks good to me.

-Does building the backend involve just running make on the makefile and building the launcher,driver images only?

For driver/launcher changes, the minimum build include driver/launcher image build and apisever image build because driver/launcher images are currently hard coded in apiserver. So the steps are:

  1. build driver/launcher with your change 2
  2. update the hardcoded path
  3. build apiserver image.

For your test build, you can use this makefile to build driver and launcher images. Alternative you can use their Dockerfiles (Dockerfile.driver & Dockerfile.launcher) which takes much longer time as it involves licensing check and notice file generation.
Once you verified your fix works, you can send a PR without changing the hardcoded image paths, and I can help build driver/launcher, push them to gcr.io/ml-pipelines and update the hardcoded paths.

Lets say we do build the launcher,driver images. Is there just one place where they are hardcoded?, will changing this break anything

Yes, only one place:

driverImage: "gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707",
launcherImage: "gcr.io/ml-pipeline/kfp-launcher@sha256:50151a8615c8d6907aa627902dce50a2619fd231f25d1e5c2a72737a2ea4001e",

Doesn't look like we would need to build the cache images (deployer). Does that sound right

No, you don't. cache server and cache deployer are for v1 caching feature only. Your change only applies to v2. Driver is where we handle caching for v2 pipelines.

Same question for metadata-writer

metadata-writer is also for v1 only, you don't need this.

to summarize, for the change we are looking to make we will need to build: apiserver, persistance-agent, cache-server, scheduledworkflow, viewercontroller, visualization, launcer,driver.

  • apiserver: definitely yes, see answers above
  • persistance-agent: maybe, it depends on your change and whether there's related logic/assumption in persistence-agent
  • cache-server: no, see answers above
  • scheduledworkflow: likely no, need to check whether they assume the path format, which I don't recall on top of my head
  • viewercontroller: no, this is irrelevant.
  • visualization: no, irrelevant
  • launcher: yes
  • driver: yes

Do we need to set KFP_VERSION env variable explicitly somewhere such that the visualization-server image tag gets updated here

No. This is irrelevant to your change.

Copy link

github-actions bot commented Feb 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Feb 1, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@b4sus
Copy link
Contributor

b4sus commented Sep 24, 2024

Hey, we started to also hit this issue using ParallelFor. Could we reopen this issue?

I also submitted simple PR (#11243) which I thing will fix the issue (tested on our cluster).

It is very similar to suggested solution - it is adding adding random string to output uri path.

google-oss-prow bot pushed a commit that referenced this issue Oct 3, 2024
@HumairAK HumairAK added this to the KFP 2.4.0 milestone Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

5 participants