Enable providing own hadoop for pyspark notebook image #220

t92549 · 2022-04-20T11:30:16Z

In the hdfs and Accumulo Dockerfiles, users can provide their own builds of Accumulo, ZooKeeper and Hadoop to be used instead of building them inside the image:

gaffer-docker/docker/accumulo/Dockerfile

Lines 50 to 54 in e26dbe7

    
           # Allow users to provide their own builds of Accumulo, ZooKeeper and Hadoop 
        
           COPY ./files/ . 
        
           # Otherwise, download official distributions 
        
           RUN if [ ! -f "./accumulo-${ACCUMULO_VERSION}-bin.tar.gz" ]; then \ 
        
           		(wget -nv -O ./accumulo-${ACCUMULO_VERSION}-bin.tar.gz ${ACCUMULO_DOWNLOAD_URL} || wget -nv -O ./accumulo-${ACCUMULO_VERSION}-bin.tar.gz ${ACCUMULO_BACKUP_DOWNLOAD_URL}); \

This can save a lot of time with repeated builds.
This cannot be done, however, for building hadoop inside the pyspark notebook Dockerfile:

gaffer-docker/docker/gaffer-pyspark-notebook/Dockerfile

Lines 34 to 39 in e26dbe7

    
           ARG HADOOP_VERSION=3.2.2 
        
           ARG HADOOP_DOWNLOAD_URL="https://www.apache.org/dyn/closer.cgi?action=download&filename=hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz" 
        
           ARG HADOOP_BACKUP_DOWNLOAD_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz" 
        
           RUN cd /opt && \ 
        
           	(wget -nv -O ./hadoop-${HADOOP_VERSION}.tar.gz ${HADOOP_DOWNLOAD_URL} || wget -nv -O ./hadoop-${HADOOP_VERSION}.tar.gz ${HADOOP_BACKUP_DOWNLOAD_URL}) && \

It would be great if this was added to that Dockerfile also.

t92549 added the beginner label Apr 20, 2022

t92549 added this to the v2_backlog milestone Apr 20, 2022

GCHQDeveloper314 added good first issue Small, lower complexity and doesn't require pre-existing Gaffer knowledge Docker Issue related to the Docker side of the project and removed beginner labels Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable providing own hadoop for pyspark notebook image #220

Enable providing own hadoop for pyspark notebook image #220

t92549 commented Apr 20, 2022

Enable providing own hadoop for pyspark notebook image #220

Enable providing own hadoop for pyspark notebook image #220

Comments

t92549 commented Apr 20, 2022