Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH issues due to incomplete conda environment activation on AWS Batch #734

Closed
pikulmar opened this issue Oct 4, 2021 · 10 comments
Closed

Comments

@pikulmar
Copy link

pikulmar commented Oct 4, 2021

The AWS Batch jobs created by Metaflow invoke Python like metaflow_TrainingFlow_linux-64_ad59afd22c0b2ee67176dd7dc845cf35ff5599f9/bin/python, which does not update the PATH variable. Therefore, Python packages relying on dependencies like NodeJS fail, even though the corresponding dependencies are installed within the Conda environment (metaflow_TrainingFlow_linux-64_ad59afd22c0b2ee67176dd7dc845cf35ff5599f9 in the example).

A work-around for this problem is to add

os.environ["PATH"] = str(Path(sys.executable).parent) + ":" + os.environ["PATH"]

to the flow step run on AWS Batch. However, it appears that a more proper solution would be Metaflow invoking conda activate 64_ad59afd22c0b2ee67176dd7dc845cf35ff5599f9 before starting python in the AWS Batch task definition.

@savingoyal
Copy link
Collaborator

@pikulmar We are already setting the PATH here. What is the environment that you are trying to set up? I can try to reproduce the issue on my end.

@pikulmar
Copy link
Author

pikulmar commented Oct 4, 2021

I will try to compile a minimum example reproducing the problem.

@savingoyal
Copy link
Collaborator

You can also print os.environ.get('PATH') to confirm if the path is being set correctly in your example.

@pikulmar
Copy link
Author

pikulmar commented Oct 4, 2021

@savingoyal I just realized that the code that you referenced above is not present in the Metaflow version that we have encountered the problem with (2.2.10), so it is possible that this issue has already been fixed. I will try to test with a newer Metaflow release.

@savingoyal
Copy link
Collaborator

Umm, that shouldn't be the case - In 2.2.10 we have the same snippet - https://github.com/Netflix/metaflow/blob/2.2.10/metaflow/plugins/conda/conda_step_decorator.py#L256

@pikulmar
Copy link
Author

pikulmar commented Oct 4, 2021

For 2.2.10, I use:

import os

from metaflow import FlowSpec, step, conda, conda_base, batch


@conda_base(libraries={...})
class TestFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.use_node)


    @batch
    @conda(libraries={"nodejs": ">=16.0.0"})
    @step
    def use_node(self):
        print(os.environ.get('PATH'))
        self.next(self.end)


    @step
    def end(self):
        pass


if __name__ == "__main__":
    TestFlow()

and get:

$ pipenv run python flow.py --environment=conda run
Metaflow 2.2.10 executing TestFlow for user:marek
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
Bootstrapping conda environment...(this could take a few minutes)
2021-10-04 14:46:24.529 Workflow starting (run-id 2012):
2021-10-04 14:46:26.532 [2012/start/17649 (pid 1423942)] Task is starting.
2021-10-04 14:46:33.063 [2012/start/17649 (pid 1423942)] Task finished successfully.
2021-10-04 14:46:36.149 [2012/use_node/17650 (pid 1424030)] Task is starting.
2021-10-04 14:46:37.501 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task is starting (status SUBMITTED)...
2021-10-04 14:46:41.922 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task is starting (status RUNNABLE)...
2021-10-04 14:46:43.036 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task is starting (status STARTING)...
2021-10-04 14:46:46.527 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task is starting (status RUNNING)...
2021-10-04 14:46:44.845 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Setting up task environment.
2021-10-04 14:46:54.757 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Downloading code package...
2021-10-04 14:46:55.470 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Code package downloaded.
2021-10-04 14:46:55.482 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task is starting.
2021-10-04 14:46:55.485 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Bootstrapping environment...
2021-10-04 14:47:37.022 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Environment bootstrapped.
2021-10-04 14:47:39.539 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2021-10-04 14:47:51.858 [2012/use_node/17650 (pid 1424030)] [eb54364a-b5ae-46be-8aa7-6c8bab43a706] Task finished with exit code 0.
2021-10-04 14:47:53.546 [2012/use_node/17650 (pid 1424030)] Task finished successfully.
2021-10-04 14:47:55.648 [2012/end/17651 (pid 1424322)] Task is starting.
2021-10-04 14:48:02.768 [2012/end/17651 (pid 1424322)] Task finished successfully.
2021-10-04 14:48:03.024 Done!

Above, the conda_base decorator is needed to install tooling for accessing our infrastructure.

@pikulmar
Copy link
Author

pikulmar commented Oct 4, 2021

As for Metaflow 2.3.6, I observe the same behavior:

$ pipenv run python flow.py --environment=conda run
Metaflow 2.3.6 executing TestFlow for user:marek
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2021-10-04 16:24:42.458 Creating local datastore in current directory (/home/marek/src/metaflow-bug-report/.metaflow)
Bootstrapping conda environment...(this could take a few minutes)
2021-10-04 16:25:59.195 Workflow starting (run-id 2019):
2021-10-04 16:26:02.296 [2019/start/17677 (pid 1455276)] Task is starting.
2021-10-04 16:26:10.347 [2019/start/17677 (pid 1455276)] Task finished successfully.
2021-10-04 16:26:13.163 [2019/use_node/17678 (pid 1455413)] Task is starting.
2021-10-04 16:26:14.448 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status SUBMITTED)...
2021-10-04 16:26:15.550 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNABLE)...
2021-10-04 16:26:45.738 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNABLE)...
2021-10-04 16:27:15.754 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNABLE)...
2021-10-04 16:27:45.929 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNABLE)...
2021-10-04 16:28:16.224 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNABLE)...
2021-10-04 16:28:25.090 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status STARTING)...
2021-10-04 16:28:55.132 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status STARTING)...
2021-10-04 16:29:03.819 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting (status RUNNING)...
2021-10-04 16:29:01.998 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Setting up task environment.
2021-10-04 16:29:12.067 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Downloading code package...
2021-10-04 16:29:12.837 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Code package downloaded.
2021-10-04 16:29:12.851 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task is starting.
2021-10-04 16:29:13.799 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Bootstrapping environment...
2021-10-04 16:29:48.793 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Environment bootstrapped.
2021-10-04 16:29:50.243 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2021-10-04 16:30:01.219 [2019/use_node/17678 (pid 1455413)] [a52d7195-0a41-4971-ba1f-100a4f1129bf] Task finished with exit code 0.
2021-10-04 16:30:02.642 [2019/use_node/17678 (pid 1455413)] Task finished successfully.
2021-10-04 16:30:05.156 [2019/end/17679 (pid 1456394)] Task is starting.
2021-10-04 16:30:12.998 [2019/end/17679 (pid 1456394)] Task finished successfully.
2021-10-04 16:30:13.625 Done!

Note, however, that I cannot test vanilla Metaflow due to how access to our infrastructure is managed. Perhaps it would be good to re-run the above tests using an "official" Metaflow-on-AWS deployment like the Sandboxes (to which I have no access).

@pikulmar
Copy link
Author

pikulmar commented Oct 4, 2021

@savingoyal
Copy link
Collaborator

@pikulmar I just created PR #735 which should address this issue.

@savingoyal
Copy link
Collaborator

Metaflow 2.4.1 addresses this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants