Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new option for an alternate mirror for spark binaries #104

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions flintrock/config.yaml.template
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
services:
spark:
version: 1.6.0
# distribution: # optional; default to '2.6'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: Two spaces before the #; "defaults" and not "default"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can we leave out the ability to specify distribution for now? I'm not sure about how best to name this option (e.g. there are non-Hadoop distributions like CDH, but we are assuming Hadoop) and, more importantly, I haven't fully considered the implications of supporting user-specified distributions.

# download-source: # optional; default to 'https://s3.amazonaws.com/spark-related-packages/spark-${version}-bin-hadoop${distribution}.tgz'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the variable substitution to be done in Python and not Bash, so the template variable should be {version} and not ${version}. This is consistent with how Flintrock substitutes variables in the config templates, for example.

Same style nitpicks as above.

# git-commit: latest # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
# git-repository: # optional; defaults to https://github.com/apache/spark
hdfs:
Expand Down
11 changes: 10 additions & 1 deletion flintrock/flintrock.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,10 +182,17 @@ def cli(cli_context, config, provider):
@click.option('--install-spark/--no-install-spark', default=True)
@click.option('--spark-version',
help="Spark release version to install.")
@click.option('--spark-distribution',
help="Hadoop distribution for Spark release to install.", default='2.6')
@click.option('--spark-git-commit',
help="Git commit to build Spark from. "
"Set to 'latest' to build Spark from the latest commit on the "
"repository's default branch.")
@click.option('--spark-download-source',
help="HTTP source to download the spark binaries. "
"Available variable : file, spark_version, distribution",
default="https://s3.amazonaws.com/spark-related-packages/spark-${version}-bin-hadoop${distribution}.tgz",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about {version} vs. ${version}.

show_default=True)
@click.option('--spark-git-repository',
help="Git repository to clone Spark from.",
default='https://github.com/apache/spark',
Expand Down Expand Up @@ -220,8 +227,10 @@ def launch(
hdfs_version,
install_spark,
spark_version,
spark_distribution,
spark_git_commit,
spark_git_repository,
spark_download_source,
assume_yes,
ec2_key_name,
ec2_identity_file,
Expand Down Expand Up @@ -286,7 +295,7 @@ def launch(
services += [hdfs]
if install_spark:
if spark_version:
spark = Spark(version=spark_version)
spark = Spark(version=spark_version, distribution=spark_distribution, download_source=spark_download_source)
elif spark_git_commit:
print(
"Warning: Building Spark takes a long time. "
Expand Down
12 changes: 8 additions & 4 deletions flintrock/scripts/install-spark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,26 @@

set -e

spark_version="$1"
version="$1"
distribution="$2"
download_source="$3"

url=$(eval "echo \"$download_source\"")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing the variable substitution in Python should eliminate code smells like this one.

file="${url##*/}"

echo "Installing Spark..."
echo " version: ${spark_version}"
echo " distribution: ${distribution}"

file="spark-${spark_version}-bin-${distribution}.tgz"
echo " download source: ${download_source}"
echo "Final Spark URL: ${url}"

# S3 is generally reliable, but sometimes when launching really large
# clusters it can hiccup on us, in which case we'll need to retry the
# download.
set +e
tries=1
while true; do
curl --remote-name "https://s3.amazonaws.com/spark-related-packages/${file}"
curl --remote-name "${url}"
curl_ret=$?

if ((curl_ret == 0)); then
Expand Down
24 changes: 14 additions & 10 deletions flintrock/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,28 +197,30 @@ def health_check(self, master_host: str):


class Spark(FlintrockService):
def __init__(self, version: str=None, git_commit: str=None, git_repository: str=None):
def __init__(self, version: str=None, distribution: str=None, git_commit: str=None, git_repository: str=None, download_source: str="https://s3.amazonaws.com/spark-related-packages/spark-${version}-bin-hadoop${distribution}.tgz"):
# TODO: Convert these checks into something that throws a proper exception.
# Perhaps reuse logic from CLI.
assert bool(version) ^ bool(git_commit)
if git_commit:
assert git_repository

self.version = version
self.distribution = distribution
self.git_commit = git_commit
self.git_repository = git_repository
self.download_source = download_source

self.manifest = {
'version': version,
'distribution': distribution,
'git_commit': git_commit,
'git_repository': git_repository}
'git_repository': git_repository,
'download_source': download_source}

def install(
self,
ssh_client: paramiko.client.SSHClient,
cluster: FlintrockCluster):
# TODO: Allow users to specify the Spark "distribution". (?)
distribution = 'hadoop2.6'

print("[{h}] Installing Spark...".format(
h=ssh_client.get_transport().getpeername()[0]))
Expand All @@ -234,11 +236,12 @@ def install(
client=ssh_client,
command="""
set -e
/tmp/install-spark.sh {spark_version} {distribution}
/tmp/install-spark.sh {version} {distribution} {download_source}
rm -f /tmp/install-spark.sh
""".format(
spark_version=shlex.quote(self.version),
distribution=shlex.quote(distribution)))
version=shlex.quote(self.version),
distribution=shlex.quote(self.distribution),
download_source=shlex.quote(self.download_source)))
else:
ssh_check_output(
client=ssh_client,
Expand All @@ -255,13 +258,14 @@ def install(
cd spark
git reset --hard {commit}
if [ -e "make-distribution.sh" ]; then
./make-distribution.sh -Phadoop-2.6
./make-distribution.sh -Phadoop-{distribution}
else
./dev/make-distribution.sh -Phadoop-2.6
./dev/make-distribution.sh -Phadoop-{distribution}
fi
""".format(
repo=shlex.quote(self.git_repository),
commit=shlex.quote(self.git_commit)))
commit=shlex.quote(self.git_commit),
distribution=shlex.quote(self.distribution)))
except Exception as e:
# TODO: This should be a more specific exception.
print("Error: Failed to install Spark.", file=sys.stderr)
Expand Down