Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Make Python container venv location configurable #29663

Open
1 of 16 tasks
phoerious opened this issue Dec 7, 2023 · 10 comments
Open
1 of 16 tasks

[Feature Request]: Make Python container venv location configurable #29663

phoerious opened this issue Dec 7, 2023 · 10 comments

Comments

@phoerious
Copy link
Contributor

What would you like to happen?

In #16658, I added the feature that the Python SDK harness boot binary installs all application dependencies into a temporary venv to make containers reusable.

At the moment, this is hard-coded to /opt/apache/beam-venv and falls back to the default env if that fails or if RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT is set. Now my own requirements have shifted and I would like to make this path configurable.

In a first iteration of my PR back then, I used the value of the -semiPersistDir flag, which defaults to /tmp. Unfortunately, this caused some Dataflow tests to fail (iirc), so we decided to use /opt/apache/beam-venv, which is writable by the beam user in the upstream Python SDK container, but may not be so in other environments.

I think, using /tmp would make sense, but making it configurable would be better. Even better would be if the runner implementation could set this automatically. For instance, the Spark runner would set this to spark.local.dir.

I could submit a PR to make this configurable with a flag (either -semiPersistDir or an additional flag if this still doesn't work). I would not know, however, how to get this value automatically from the job runner. Perhaps someone else has an idea?

cc @tvalentyn

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor

semiPersistDir should be set by the runner when runner lauches SDK harness container. It might be configurable already, based on some references in codebase:

systemProperty "semiPersistDir", config.semiPersistDir

Would setting semiPersistDir to spark.local.dir work for all spark runner users, so that users don't have to worry about this knob?

given that venv in semipersist dir doesn't work for dataflow, we could detect if dataflow is used as a special case and if so, set RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT. The extra venv wrap doesn't make much sense for dataflow.

@phoerious
Copy link
Contributor Author

Spark local dir is pretty much your writable tmp scratch space. That's where all the temporary output files and shuffle spills land. If you run Spark in a container, it's usually a semi-persistent storage mount, such as a Kubernetes emptyDir. I think Flink has something similar to this.

@tvalentyn
Copy link
Contributor

It sounds reasonable to use spark.local.dir for Spark runner.

Unfortunately, this caused some Dataflow tests to fail (iirc)
...
it's usually a semi-persistent storage mount, such as a Kubernetes emptyDir

Note that the rootcause of Dataflow's issue was that semi-persistent dir was backed by /var directory, which is often configured as noexec partition, presumably as an additional security measure.

Per https://stackoverflow.com/questions/46525081/does-kubernetes-mount-an-emtpydir-volume-on-the-host , emptyDir also might live in /var. I wonder if you might encounter same frictions.

@phoerious
Copy link
Contributor Author

phoerious commented Dec 12, 2023

Inside the container, it's usually /var/data/SOME-ID. On the host, it lives somewhere under /var/lib/kubelet by default. In non-Kubernetes deployments, spark.local.dir defaults to /tmp.

You cannot set mount flags for emptyDirs, so according to the docs, you cannot set noexec: https://cloud.google.com/architecture/best-practices-for-building-containers#file_system_security
So I guess the only friction we could see is if the host has /var mounted that way. Although I wonder whether that wouldn't break other things.

@tvalentyn
Copy link
Contributor

the only friction we could see is if the host has /var mounted that way

That's what I am worried about. It might depend on host OS. Dataflow uses CointainerOS which had this issue back when I looked at it last.

@phoerious
Copy link
Contributor Author

There is probably no solution that works everywhere. You also cannot expect /opt/apache to be writable on anything but the standard SDK container.
Does Dataflow even support running Spark jobs?

@tvalentyn
Copy link
Contributor

tvalentyn commented Dec 12, 2023

This is not a dataflow concern, I am just wondering if you are going to hit the same problem on your spark cluster as we had with dataflow when we created venv in a folder that was in a docker volume that was mapped onto a noexec partion on the host VM.

@tvalentyn
Copy link
Contributor

On the host, it lives somewhere under /var/lib/kubelet by default.

I suppose you could check whether this a noexec partition in your deployment.

@phoerious
Copy link
Contributor Author

I know for sure it doesn't. The question is whether this is the default for most other deployments out there.

@tvalentyn
Copy link
Contributor

Is it possible to check whether folder has executable permission inside docker container, or this can only be checked on host vm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants