-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interactive docker image #709
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly looked at the dotnet-interactive/Dockerfile. I left a few suggestions/questions that may help simplify the Dockerfile.
@@ -0,0 +1,46 @@ | |||
FROM jupyter/base-notebook:ubuntu-18.04 | |||
|
|||
ARG NB_USER=jovyan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the usage in which this ARG would be specified when building? I am not seeing how this would work if a different user were specified. By this I mean I don't see the user being defined/added. The base image defines the "jovyan" user, why would you want do define something different? The same applies to NB_UID
.
ARG NB_USER=jovyan | ||
ARG NB_UID=1000 | ||
ARG DOTNET_CORE_VERSION=3.1 | ||
ARG DEBIAN_FRONTEND=noninteractive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? The base image is already defining ENV DEBIAN_FRONTEND=noninteractive
ARG DEBIAN_FRONTEND=noninteractive | ||
|
||
ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION | ||
ENV USER ${NB_USER} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to a previous comment, are these USER
related ENV
s necessary since they are defined in the base image.
DOTNET_RUNNING_IN_CONTAINER=true \ | ||
DOTNET_USE_POLLING_FILE_WATCHER=true \ | ||
NUGET_XMLDOC_MODE=skip \ | ||
DOTNET_TRY_CLI_TELEMETRY_OPTOUT=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is being proposed as part of the .NET project, telemetry should remain enabled.
|
||
RUN apt-get update \ | ||
&& apt-get install -y --no-install-recommends \ | ||
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is requiring all of these dependencies. Ones that are already provided by the base image don't seem necessary to include.
&& apt-get install -y apt-transport-https \ | ||
&& apt-get update \ | ||
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \ | ||
&& apt-get autoremove -y --purge \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this remove anything?
&& apt-get update \ | ||
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \ | ||
&& apt-get autoremove -y --purge \ | ||
&& apt-get clean \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary because of the next line.
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ | ||
&& chmod 755 /usr/local/bin/start-spark-debug.sh \ | ||
&& chown -R ${NB_UID} ${HOME} \ | ||
&& cd ${HOME}/nb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem necessary.
|
||
RUN apt-get update \ | ||
&& apt-get install -y --no-install-recommends \ | ||
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the Dockerfile best practices, it helps the readability to break this apart one package per line and alphabetize them.
@MichaelSimons, thank you very much for reviewing the Dockerfile and your valuable comments. I've just updated the PR to reflect the changes required for dotnet.spark version 1.0.0. This should also include an updated version of the Dockerfile, addressing your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@indy-3rdman, Thanks for addressing my comments. I took another look and had a few more suggestions and questions.
@@ -21,24 +14,30 @@ USER root | |||
|
|||
RUN apt-get update \ | |||
&& apt-get install -y --no-install-recommends \ | |||
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \ | |||
libc6 libgcc1 libgssapi-krb5-2 libicu60 libssl1.1 libstdc++6 zlib1g \ | |||
apt-utils \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be cleaned up now. Java obviously is required by spark.
libgssapi-krb5-2 \ | ||
libicu60 \ | ||
libssl1.1 \ | ||
libstdc++6 zlib1g \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple packages listed together, should get split apart so that zlib1g is not overlooked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in a separate line as well now.
libstdc++6 zlib1g \ | ||
openjdk-8-jdk \ | ||
software-properties-common \ | ||
unzip \ | ||
&& wget -q --show-progress --progress=bar:force:noscroll https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider: I've typically seen Dockerfile avoid using --show-progress as it does have a perf impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That raises an interesting point about the purpose of the Dockerfile(s). As far as I am aware, the focus at the moment is to enable an user to build the image(s) her/himself, instead of automating the image build process. That's why I thought it would be useful to show the download progress. Now, for small downloads that doesn't really matter and I therefore have removed it. However, I have added the following line to the dotnet-spark/Dockerfile
&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \
as the spark download can take a while. Does that make sense?
|
||
RUN mkdir -p /dotnet/HelloSpark \ | ||
&& mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \ | ||
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should at some point perform checksum validation. @JeremyLikness, are checksums published for the Spark artifacts?
RUN mkdir -p /dotnet/HelloSpark \ | ||
&& mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \ | ||
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \ | ||
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered extracting the tarball to the /dotnet
folder instead of extracting it to the working dir and then immediately mv
it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be changed in the latest commit.
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \ | ||
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \ | ||
&& mv Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} /dotnet/ \ | ||
&& cp /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.exe \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is just a leftover from another file. Removed.
&& chmod 755 /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker \ | ||
&& rm Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz | ||
|
||
COPY HelloSpark /dotnet/HelloSpark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the HelloSpark project is used to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode. Is it necessary to have the project in the resulting image? It feels a little strange to have a sample project in a "base" image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from dotnet-spark-base to dotnet-spark Dockerfile. Additionally this is now removed, after the jar file has been copied.
|
||
USER root | ||
ENV DAEMON_RUN=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using the multi-line ENV format? I find that is can help the readability of the Dockerfiles in that it helps the reader easily scan the Dockerfile. Tt makes it more obvious these are all ENV instructions.
ENV DAEMON_RUN=true \
DOTNETBACKEND_PORT=5567 \
HADOOP_VERSION=2.7 \
JUPYTER_ENABLE_LAB=true \
SPARK_VERSION=$SPARK_VERSION \
SPARK_HOME=/spark \
PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the Dockerfiles.
ARG DOTNET_SPARK_VERSION=0.12.1 | ||
ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION | ||
ENV DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} | ||
ARG SPARK_VERSION=3.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having version numbers like this hard coded gives me pause. Is this done so that the Dockerfile as it is checked in is buildable without having to specify any args? The problem that introduces is a maintenance burden of keeping it up-to-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to my earlier point about the purpose of the Dockerfile(s). The intention was to have a build-able Dockerfile even if the build script is not used. I agree with your observation about maintenance. Maybe @rapoth has a view on that.
@@ -8,12 +8,17 @@ set -o nounset # abort on unbound variable | |||
set -o pipefail # don't hide errors within pipes | |||
|
|||
readonly image_repository='3rdman' | |||
readonly supported_apache_spark_versions=("2.3.3" "2.3.4" "2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6") | |||
readonly supported_dotnet_spark_versions=("0.9.0" "0.10.0" "0.11.0" "0.12.1") | |||
readonly supported_apache_spark_versions=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
ENV PATH="${PATH}:${HOME}/.dotnet/tools" | ||
|
||
ENV DOTNET_RUNNING_IN_CONTAINER=true \ | ||
ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)
&& rm -rf /dotnet/HelloSpark \ | ||
&& cd / \ | ||
&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \ | ||
&& wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ | ||
&& tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can extract to the spark directory with a single instruction which would eliminate the need for mv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean to use tar with --directory. But wouldn't that required that the directory exist already? In that case I'd have to add a mkdir first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, I missed what was happening here. Please ignore my comment.
RUN cd /dotnet/HelloSpark \ | ||
&& dotnet build \ | ||
&& cp /dotnet/HelloSpark/bin/Debug/netcoreapp${DOTNET_CORE_VERSION}/microsoft-spark-*.jar ${HOME}/ \ | ||
&& rm -rf /dotnet/HelloSpark \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unfortunate consequence of this pattern is that HelloSpark remains in the image as a result of obtaining it via COPY. This is not desirable. Is there a way this can be generated during the Docker build or can it be a published tarball so that is can get copied and deleted within a single Dockerfile instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @MichaelSimons for your great feedback! Just creating a dummy project during the build process now.
This PR contains a build script, Dockerfile(s), README.md and supporting files to create a docker image that can run .NET for Apache Spark in an interactive Jupyter notebook.
An initial description for the interactive image, along with the folder structure can be found here: https://github.com/indy-3rdman/spark/tree/interactive-docker-image/docker/images/interactive