Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not bloat spark image with ENV variables #2081

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 5 additions & 7 deletions images/pyspark-notebook/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,20 +34,18 @@ ARG scala_version
# But it seems to be slower, that's why we use the recommended site for download
ARG spark_download_url="https://dlcdn.apache.org/spark/"

# Configure Spark
ENV SPARK_VERSION="${spark_version}" \
HADOOP_VERSION="${hadoop_version}" \
SCALA_VERSION="${scala_version}" \
SPARK_DOWNLOAD_URL="${spark_download_url}"
Comment on lines -38 to -41
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These env variable names are not something common, it's something we made up, so I don't think we should set them in the images.


ENV SPARK_HOME=/usr/local/spark
ENV PATH="${PATH}:${SPARK_HOME}/bin"
ENV SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info"

COPY setup_spark.py /opt/setup-scripts/

# Setup Spark
RUN /opt/setup-scripts/setup_spark.py
RUN SPARK_VERSION="${spark_version}" \
HADOOP_VERSION="${hadoop_version}" \
SCALA_VERSION="${scala_version}" \
SPARK_DOWNLOAD_URL="${spark_download_url}" \
/opt/setup-scripts/setup_spark.py

# Configure IPython system-wide
COPY ipython_kernel_config.py "/etc/ipython/"
Expand Down
18 changes: 0 additions & 18 deletions tagging/taggers.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,6 @@ def _get_program_version(container: Container, program: str) -> str:
return DockerRunner.run_simple_command(container, cmd=f"{program} --version")


def _get_env_variable(container: Container, variable: str) -> str:
env = DockerRunner.run_simple_command(
container,
cmd="env",
print_result=False,
).split()
for env_entry in env:
if env_entry.startswith(variable):
return env_entry[len(variable) + 1 :]
raise KeyError(variable)


def _get_pip_package_version(container: Container, package: str) -> str:
PIP_VERSION_PREFIX = "Version: "

Expand Down Expand Up @@ -136,12 +124,6 @@ def tag_value(container: Container) -> str:
return "spark-" + version_line.split(" ")[-1]


class HadoopVersionTagger(TaggerInterface):
@staticmethod
def tag_value(container: Container) -> str:
return "hadoop-" + _get_env_variable(container, "HADOOP_VERSION")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty useless, because it just says "3". And it custom builds people obviously know their Hadoop version, so they are not interested in tagging.

I also think that getting versions from env is not the best solution, programs should be able to tell their versions themselves.



class JavaVersionTagger(TaggerInterface):
@staticmethod
def tag_value(container: Container) -> str:
Expand Down
Loading