Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to download Spark from a custom URL #125

Merged
merged 5 commits into from
Jun 29, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion flintrock/config.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,16 @@ services:
version: 1.6.1
# git-commit: latest # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
# git-repository: # optional; defaults to https://github.com/apache/spark
# optional; defaults to download from from the official Spark S3 bucket
# - must contain a {v} template corresponding to the version
# - Spark must be pre-built
# - must be a tar.gz file
# download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should as well mention that the download_source has to be prebuilt?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like this?

# optional; defaults to download from official Spark S3 bucket
#   - must contain a {v} template corresponding to the version
#   - Spark must be pre-built
#   - must be a .tar.gz file
# download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"

And then we should update the matching comment for the Hadoop download source to follow similar formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this seems clearer

hdfs:
version: 2.7.2
# optional; defaults to download from a dynamically selected Apache mirror
# must contain a {v} template corresponding to the version; must be a .tar.gz file
# - must contain a {v} template corresponding to the version
# - must be a .tar.gz file
# download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"

provider: ec2
Expand Down
7 changes: 6 additions & 1 deletion flintrock/flintrock.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,10 @@ def cli(cli_context, config, provider):
@click.option('--install-spark/--no-install-spark', default=True)
@click.option('--spark-version',
help="Spark release version to install.")
@click.option('--spark-download-source',
help="URL to download a release of Spark from.",
default='https://s3.amazonaws.com/spark-related-packages/spark-{v}-bin-hadoop2.6.tgz',
show_default=True)
@click.option('--spark-git-commit',
help="Git commit to build Spark from. "
"Set to 'latest' to build Spark from the latest commit on the "
Expand Down Expand Up @@ -227,6 +231,7 @@ def launch(
spark_version,
spark_git_commit,
spark_git_repository,
spark_download_source,
assume_yes,
ec2_key_name,
ec2_identity_file,
Expand Down Expand Up @@ -289,7 +294,7 @@ def launch(
services += [hdfs]
if install_spark:
if spark_version:
spark = Spark(version=spark_version)
spark = Spark(version=spark_version, download_source=spark_download_source)
elif spark_git_commit:
print(
"Warning: Building Spark takes a long time. "
Expand Down
10 changes: 4 additions & 6 deletions flintrock/scripts/install-spark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,20 @@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is the download-hadoop script written in python, whereas this one is in bash?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunately inconsistency, and eventually I think both scripts should be in Python. The reason download-hadoop was written in Python was because of the Apache mirror selection logic, which seemed like a bit much to do purely in Bash.

set -e

spark_version="$1"
distribution="$2"
url="$1"

echo "Installing Spark..."
echo " version: ${spark_version}"
echo " distribution: ${distribution}"
echo " from: ${url}"

file="spark-${spark_version}-bin-${distribution}.tgz"
file="$(basename ${url})"

# S3 is generally reliable, but sometimes when launching really large
# clusters it can hiccup on us, in which case we'll need to retry the
# download.
set +e
tries=1
while true; do
curl --remote-name "https://s3.amazonaws.com/spark-related-packages/${file}"
curl --remote-name "${url}"
curl_ret=$?

if ((curl_ret == 0)); then
Expand Down
14 changes: 7 additions & 7 deletions flintrock/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,28 +202,29 @@ def health_check(self, master_host: str):


class Spark(FlintrockService):
def __init__(self, version: str=None, git_commit: str=None, git_repository: str=None):
def __init__(self, version: str=None, download_source: str=None,
git_commit: str=None, git_repository: str=None):
# TODO: Convert these checks into something that throws a proper exception.
# Perhaps reuse logic from CLI.
assert bool(version) ^ bool(git_commit)
if git_commit:
assert git_repository

self.version = version
self.download_source = download_source
self.git_commit = git_commit
self.git_repository = git_repository

self.manifest = {
'version': version,
'download_source': download_source,
'git_commit': git_commit,
'git_repository': git_repository}

def install(
self,
ssh_client: paramiko.client.SSHClient,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up we could support a {d} template in download_source as is done with the version with {v}.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and that would address #88, though it seems like with this PR you can already choose your distribution at will, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you can choose your distribution if you specify your own download source.

However, we might want to support the use case of someone only specifying the spark version and distribution. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for now let's leave it like this. I have some vague concerns about "officially" supporting other distributions, in case they have annoying problems that we would have to work around. With the download source option, people who really want a different distribution can get it, and we have a bit more of an excuse to deflect support if there are serious issues.

It's definitely something I am open to revisiting in the future, though.

cluster: FlintrockCluster):
# TODO: Allow users to specify the Spark "distribution". (?)
distribution = 'hadoop2.6'

print("[{h}] Installing Spark...".format(
h=ssh_client.get_transport().getpeername()[0]))
Expand All @@ -235,15 +236,14 @@ def install(
localpath=os.path.join(SCRIPTS_DIR, 'install-spark.sh'),
remotepath='/tmp/install-spark.sh')
sftp.chmod(path='/tmp/install-spark.sh', mode=0o755)
url = self.download_source.format(v=self.version)
ssh_check_output(
client=ssh_client,
command="""
set -e
/tmp/install-spark.sh {spark_version} {distribution}
/tmp/install-spark.sh {url}
rm -f /tmp/install-spark.sh
""".format(
spark_version=shlex.quote(self.version),
distribution=shlex.quote(distribution)))
""".format(url=shlex.quote(url)))
else:
ssh_check_output(
client=ssh_client,
Expand Down