-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new option for an alternate mirror for spark binaries #104
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
services: | ||
spark: | ||
version: 1.6.0 | ||
# distribution: # optional; default to '2.6' | ||
# download-source: # optional; default to 'https://s3.amazonaws.com/spark-related-packages/spark-${version}-bin-hadoop${distribution}.tgz' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer the variable substitution to be done in Python and not Bash, so the template variable should be Same style nitpicks as above. |
||
# git-commit: latest # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350 | ||
# git-repository: # optional; defaults to https://github.com/apache/spark | ||
hdfs: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -182,10 +182,17 @@ def cli(cli_context, config, provider): | |
@click.option('--install-spark/--no-install-spark', default=True) | ||
@click.option('--spark-version', | ||
help="Spark release version to install.") | ||
@click.option('--spark-distribution', | ||
help="Hadoop distribution for Spark release to install.", default='2.6') | ||
@click.option('--spark-git-commit', | ||
help="Git commit to build Spark from. " | ||
"Set to 'latest' to build Spark from the latest commit on the " | ||
"repository's default branch.") | ||
@click.option('--spark-download-source', | ||
help="HTTP source to download the spark binaries. " | ||
"Available variable : file, spark_version, distribution", | ||
default="https://s3.amazonaws.com/spark-related-packages/spark-${version}-bin-hadoop${distribution}.tgz", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment about |
||
show_default=True) | ||
@click.option('--spark-git-repository', | ||
help="Git repository to clone Spark from.", | ||
default='https://github.com/apache/spark', | ||
|
@@ -220,8 +227,10 @@ def launch( | |
hdfs_version, | ||
install_spark, | ||
spark_version, | ||
spark_distribution, | ||
spark_git_commit, | ||
spark_git_repository, | ||
spark_download_source, | ||
assume_yes, | ||
ec2_key_name, | ||
ec2_identity_file, | ||
|
@@ -286,7 +295,7 @@ def launch( | |
services += [hdfs] | ||
if install_spark: | ||
if spark_version: | ||
spark = Spark(version=spark_version) | ||
spark = Spark(version=spark_version, distribution=spark_distribution, download_source=spark_download_source) | ||
elif spark_git_commit: | ||
print( | ||
"Warning: Building Spark takes a long time. " | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,22 +2,26 @@ | |
|
||
set -e | ||
|
||
spark_version="$1" | ||
version="$1" | ||
distribution="$2" | ||
download_source="$3" | ||
|
||
url=$(eval "echo \"$download_source\"") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think doing the variable substitution in Python should eliminate code smells like this one. |
||
file="${url##*/}" | ||
|
||
echo "Installing Spark..." | ||
echo " version: ${spark_version}" | ||
echo " distribution: ${distribution}" | ||
|
||
file="spark-${spark_version}-bin-${distribution}.tgz" | ||
echo " download source: ${download_source}" | ||
echo "Final Spark URL: ${url}" | ||
|
||
# S3 is generally reliable, but sometimes when launching really large | ||
# clusters it can hiccup on us, in which case we'll need to retry the | ||
# download. | ||
set +e | ||
tries=1 | ||
while true; do | ||
curl --remote-name "https://s3.amazonaws.com/spark-related-packages/${file}" | ||
curl --remote-name "${url}" | ||
curl_ret=$? | ||
|
||
if ((curl_ret == 0)); then | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick: Two spaces before the
#
; "defaults" and not "default"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can we leave out the ability to specify distribution for now? I'm not sure about how best to name this option (e.g. there are non-Hadoop distributions like CDH, but we are assuming Hadoop) and, more importantly, I haven't fully considered the implications of supporting user-specified distributions.