Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scala as --build-arg #1757

Merged
merged 21 commits into from
Jul 27, 2022
Merged

Add Scala as --build-arg #1757

merged 21 commits into from
Jul 27, 2022

Conversation

bjornjorgensen
Copy link
Contributor

@bjornjorgensen bjornjorgensen commented Jul 26, 2022

Describe your changes

add Scala version choice

Issue ticket if applicable

[BUG] pyspark-notebook no longer builds for Spark 3.1.3, 3.1.2, 2.4.8 - Fix: #1756

Checklist (especially for first-time contributors)

  • I have performed a self-review of my code
  • If it is a core feature, I have added thorough tests
  • I will try not to use force-push to make the review process easier for reviewers
  • I have updated the documentation for significant changes

@mathbunnyru mathbunnyru marked this pull request as draft July 26, 2022 18:12
@mathbunnyru
Copy link
Member

Converted this to a draft - it won't affect you, but no one will merge this accidentally :)

@bjornjorgensen
Copy link
Contributor Author

@mathbunnyru Thank you :)
[WIP] Work in progress...

I have not tested this one, but I hope we can have a choose for scala version.

@bjornjorgensen
Copy link
Contributor Author

bjornjorgensen commented Jul 26, 2022

Wooo.. have a runner ;)

docker build --pull --rm -f "Dockerfile" -t dockerstacks:latest .
Sending build context to Docker daemon  5.632kB
Step 1/24 : ARG OWNER=jupyter
Step 2/24 : ARG BASE_CONTAINER=$OWNER/scipy-notebook
Step 3/24 : FROM $BASE_CONTAINER

(...)

Step 15/24 : RUN if [ -z "$scala_version" ] ; then     wget -q "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" &&     echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - &&     tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" ;  else     wget -q "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" &&     echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" | sha512sum -c - &&     tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ;  fi
 ---> Running in acf46391dc2f
spark-3.3.0-bin-hadoop3.tgz: OK










docker build --pull --rm -f "Dockerfile" -t dockerstacks:latest . --build-arg scala_version="2.13"
Sending build context to Docker daemon  5.632kB
Step 1/24 : ARG OWNER=jupyter
Step 2/24 : ARG BASE_CONTAINER=$OWNER/scipy-notebook
Step 3/24 : FROM $BASE_CONTAINER

(..)

Step 15/24 : RUN if [ -z "$scala_version" ] ; then     wget -q "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" &&     echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - &&     tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" ;  else     wget -q "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" &&     echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" | sha512sum -c - &&     tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ;  fi
 ---> Running in b1b26b44bebd
spark-3.3.0-bin-hadoop3-scala2.13.tgz: FAILED
sha512sum: WARNING: 1 computed checksum did NOT match



Copy link
Member

@mathbunnyru mathbunnyru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some suggestions.

Also, please remove the docs telling old spark versions won't work properly (because they should now).

pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
@bjornjorgensen
Copy link
Contributor Author

Tested with and they are OK.
docker build --rm --force-rm -t jupyter/pyspark-notebook:spark-3.3.0 . --build-arg spark_version=3.3.0 --build-arg hadoop_version=3 --build-arg spark_checksum=4c09dac70e22bf1d5b7b2cabc1dd92aba13237f52a5b682c67982266fc7a0f5e0f964edff9bc76adbd8cb444eb1a00fdc59516147f99e4e2ce068420ff4881f0 --build-arg openjdk_version=17 --build-arg scala_version="2.13"

docker build --rm --force-rm -t pyspark-notebook:spark-3.3.0-def . --build-arg spark_version=3.3.0 --build-arg hadoop_version=3 --build-arg spark_checksum=1e8234d0c1d2ab4462d6b0dfe5b54f2851dcd883378e0ed756140e10adfb5be4123961b521140f580e364c239872ea5a9f813a20b73c69cb6d4e95da2575c29c --build-arg openjdk_version=17

pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
pyspark-notebook/Dockerfile Outdated Show resolved Hide resolved
@bjornjorgensen
Copy link
Contributor Author

Testing with

RUN if [ -z "${scala_version}" ] ; then \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" && ;\
    #echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
    #tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
    #rm "spark.tgz" ;\
  else \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" && ;\
    #echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
    #tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
    #rm "spark.tgz" ;\
  fi \
    echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
    tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
    rm "spark.tgz"

Gives this error

Step 15/24 : RUN if [ -z "${scala_version}" ] ; then     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" && ;  else     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" && ;  fi     echo "${spark_checksum} *spark.tgz" | sha512sum -c - &&     tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark.tgz"
 ---> Running in 64a679d8a650
/bin/bash: -c: line 0: syntax error near unexpected token `;'
/bin/bash: -c: line 0: `if [ -z "${scala_version}" ] ; then     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" && ;  else     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" && ;  fi     echo "${spark_checksum} *spark.tgz" | sha512sum -c - &&     tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark.tgz"'

@bjornjorgensen bjornjorgensen changed the title [WIP] add Scala as choice Add Scala as choice Jul 27, 2022
@bjornjorgensen bjornjorgensen changed the title Add Scala as choice Add Scala as --build-arg Jul 27, 2022
@Bidek56
Copy link
Contributor

Bidek56 commented Jul 27, 2022

You can make it conditional.

RUN  [ -z "${scala_version}" ] && \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"
RUN  [ -n "${scala_version}" ] && \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz"

echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
rm "spark.tgz"

@bjornjorgensen
Copy link
Contributor Author

@Bidek56 Yes, but that is 2 RUN and @mathbunnyru and I don`t will have that.

@Bidek56
Copy link
Contributor

Bidek56 commented Jul 27, 2022

How about this with a single RUN?

RUN  [ -n "${scala_version}" ] && \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" || \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"

echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
rm "spark.tgz"

@bjornjorgensen
Copy link
Contributor Author

Sending build context to Docker daemon 10.24kB
Error response from daemon: dockerfile parse error line 40: unknown instruction: ECHO

@bjornjorgensen
Copy link
Contributor Author

@Bidek56 Try with

RUN  [ -n "${scala_version}" ] && \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" || \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \

    echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
    tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
    rm "spark.tgz"

But then I get

Step 15/24 : RUN  [ -n "${scala_version}" ] &&     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ||     wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"     echo "${spark_checksum} *spark.tgz" | sha512sum -c - &&     tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark.tgz"
 ---> Running in 18ccdf491990
sha512sum: 'standard input': no properly formatted SHA512 checksum lines found

@Bidek56
Copy link
Contributor

Bidek56 commented Jul 27, 2022

I think it needs to be:

RUN ( [ -n "${scala_version}" ] && \ 
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" || \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" ) && \
    echo "${spark_checksum} *spark.tgz" | sha512sum -c - &&     tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner &&     rm "spark.tgz"

but I am still testing it.

@mathbunnyru
Copy link
Member

@bjornjorgensen this example should help you to reduce code duplication:

FROM ubuntu

ARG my_var

RUN if [ "${my_var}" = "" ]; then \
        echo "My var was not provided"; \
        touch another_command.txt; \
    else \
        echo "My var was provided and it's value is ${my_var}"; \
        touch another_command.txt; \
    fi && \
    echo "This is common part" && \
    touch doing_common_stuff.txt

@Bidek56
Copy link
Contributor

Bidek56 commented Jul 27, 2022

docker build does not seem to like this command:

RUN if [ "${scala_version}" = "" ] ; then \
        wget --quiet -O "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"; \
    else \
        wget --quiet -O "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ; \
    fi

I get:

executor failed running [/bin/sh -c if [ "${scala_version}" = "" ] ; then wget --quiet -O "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"; else wget --quiet -O "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ; fi]: exit code: 5

@bjornjorgensen
Copy link
Contributor Author

bjornjorgensen commented Jul 27, 2022

This seams to work..

RUN if [ -z "${scala_version}" ] ; then \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" ;\
  else \
    wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ;\
  fi && \
  echo "${spark_checksum} *spark.tgz" | sha512sum -c - && \
  tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && \
  rm "spark.tgz"

and this one..

RUN if [ -z "${scala_version}" ] ; then \
    ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" ${SPARK_HOME} ;\
  else \
    ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}" ${SPARK_HOME} ;\
  fi && \
  # Add a link in the before_notebook hook in order to source automatically PYTHONPATH && \
  mkdir -p /usr/local/bin/before-notebook.d && \
  ln -s "${SPARK_HOME}/sbin/spark-config.sh" /usr/local/bin/before-notebook.d/spark-config.sh

@Bidek56
Copy link
Contributor

Bidek56 commented Jul 27, 2022

It's failing for me locally but if it works in GA then great!

executor failed running [/bin/sh -c if [ -z "${scala_version}" ] ; then wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" ; else wget -qO "spark.tgz" "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}-scala${scala_version}.tgz" ; fi && echo "${spark_checksum} *spark.tgz" | sha512sum -c - && tar xzf "spark.tgz" -C /usr/local --owner root --group root --no-same-owner && rm "spark.tgz"]: exit code: 5

@bjornjorgensen
Copy link
Contributor Author

@Bidek56 exit code: 5 is problems with network.

This works for me. I test this on Manjaro with docker version 20.10.17

docker build --rm --force-rm     -t pyspark-notebook:spark-3.3.0 .     --build-arg spark_version=3.3.0     --build-arg hadoop_version=3     --build-arg spark_checksum=4c09dac70e22bf1d5b7b2cabc1dd92aba13237f52a5b682c67982266fc7a0f5e0f964edff9bc76adbd8cb444eb1a00fdc59516147f99e4e2ce068420ff4881f0     --build-arg openjdk_version=17 --build-arg scala_version="2.13"


docker build --rm --force-rm     -t pyspark-notebook:spark-3.3.0-def .     --build-arg spark_version=3.3.0     --build-arg hadoop_version=3     --build-arg spark_checksum=1e8234d0c1d2ab4462d6b0dfe5b54f2851dcd883378e0ed756140e10adfb5be4123961b521140f580e364c239872ea5a9f813a20b73c69cb6d4e95da2575c29c     --build-arg openjdk_version=17

@mathbunnyru
Copy link
Member

Squash merged this to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] - pyspark-notebook no longer builds for Spark 3.1.3, 3.1.2, 2.4.8
3 participants