Skip to content

Commit

Permalink
Spark: Update Hadoop/AWS Sdk version + set all to Spark3 (#334)
Browse files Browse the repository at this point in the history
  • Loading branch information
akhurana001 authored Jan 21, 2021
1 parent ab0e02a commit 0b13ee6
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 18 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ pip install "flytekit[spark]" for Spark 2.4.x
pip install "flytekit[spark3]" for Spark 3.x
```

Please note that Spark 2.4 support is deprecated and will be removed in a future release.

#### Schema

If `Types.Schema()` is to be used for computations involving large dataframes, one should install the `schema` extension.
Expand Down Expand Up @@ -80,11 +82,11 @@ pip install flytekit[tensorflow]
To install all or multiple available plugins, one can specify them individually:

```bash
pip install "flytekit[sidecar,spark,schema]"
pip install "flytekit[sidecar,spark3,schema]"
```

Or install them with the `all` or `all-spark2.4` or `all-spark3` directives which will install all the plugins and a specific Spark version.
Please note that `all` currently defaults to Spark 2.4.x. In a future release (starting 0.15.x), `all` will be switched to use Spark 3.x.
Please note that `all` defaults to Spark 3.0 and Spark 2.4 support will be fully removed in a future release.


```bash
Expand Down
5 changes: 3 additions & 2 deletions flytekit/tools/lazy_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ def get_extras_require(cls):

d["all-spark2.4"] = all_plugins_spark2
d["all-spark3"] = all_plugins_spark3
# all points to Spark 2.4
d["all"] = all_plugins_spark2
# all points to Spark 3.x.
# Spark 2.4 to be fully removed in a future release.
d["all"] = all_plugins_spark3
return d


Expand Down
4 changes: 3 additions & 1 deletion scripts/flytekit_install_spark.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
#!/bin/bash

# DEPRECATED
# Please note that Spark 2.4 support is deprecated and will be fully removed in a Future Release.
#
# Fetches and install Spark and its dependencies. To be invoked by the Dockerfile

# echo commands to the terminal output
set -ex

Expand Down
19 changes: 6 additions & 13 deletions scripts/flytekit_install_spark3.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ mkdir -p /opt/spark/work-dir
touch /opt/spark/RELEASE

# Fetch Spark Distribution
wget https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz -O spark-dist.tgz
echo '98f6b92e5c476d7abb93cc179c2616aa5dc897da25753bd197e20ef54a28d945 spark-dist.tgz' | sha256sum --check
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz -O spark-dist.tgz
echo 'e2d05efa1c657dd5180628a83ea36c97c00f972b4aee935b7affa2e1058b0279 spark-dist.tgz' | sha256sum --check
mkdir -p spark-dist
tar -xvf spark-dist.tgz -C spark-dist --strip-components 1

Expand All @@ -41,14 +41,7 @@ chmod +x /opt/entrypoint.sh
rm -rf spark-dist.tgz
rm -rf spark-dist

# Fetch Hadoop Distribution with AWS Support
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz -O hadoop-dist.tgz
echo 'd129d08a2c9dafec32855a376cbd2ab90c6a42790898cabbac6be4d29f9c2026 hadoop-dist.tgz' | sha256sum --check
mkdir -p hadoop-dist
tar -xvf hadoop-dist.tgz -C hadoop-dist --strip-components 1

cp -rf hadoop-dist/share/hadoop/tools/lib/hadoop-aws-2.7.7.jar /opt/spark/jars
cp -rf hadoop-dist/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar /opt/spark/jars

rm -rf hadoop-dist.tgz
rm -rf hadoop-dist
# Hadoop dist (via Apache) has older AWS SDK version. Fetch requried AWS jars from maven directly (not-ideal) to support IAM role
# https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar -P /opt/spark/jars
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.740/aws-java-sdk-bundle-1.11.740.jar -P /opt/spark/jars

0 comments on commit 0b13ee6

Please sign in to comment.