Add option to download Spark from a custom URL #125

BenFradet · 2016-06-22T20:15:19Z

This PR adds a download_source to the Spark service as has been done in #118.

I created clusters with both with and without a download_source to test the new feature.

Fixes #101, fixes #88.
Closes #104.

BenFradet · 2016-06-22T20:17:24Z

flintrock/services.py

            'git_commit': git_commit,
            'git_repository': git_repository}

    def install(
            self,
            ssh_client: paramiko.client.SSHClient,
            cluster: FlintrockCluster):
-        # TODO: Allow users to specify the Spark "distribution". (?)
-        distribution = 'hadoop2.6'


As a follow up we could support a {d} template in download_source as is done with the version with {v}.

Agreed, and that would address #88, though it seems like with this PR you can already choose your distribution at will, right?

Yup, you can choose your distribution if you specify your own download source.

However, we might want to support the use case of someone only specifying the spark version and distribution. What do you think?

Hmm, for now let's leave it like this. I have some vague concerns about "officially" supporting other distributions, in case they have annoying problems that we would have to work around. With the download source option, people who really want a different distribution can get it, and we have a bit more of an excuse to deflect support if there are serious issues.

It's definitely something I am open to revisiting in the future, though.

nchammas · 2016-06-26T23:01:49Z

Thank you for this PR @BenFradet. Looks good to me! I left some minor comments.

BenFradet · 2016-06-27T07:10:03Z

Great, thanks for your review, will update accordingly.

ereed-tesla · 2016-06-28T23:44:34Z

This is slick -- S3 support via the Spark hadoop-2.4 binary is pretty convenient. Is there anything remaining to get this merged in?

I confirmed this PR works by merging into master and doing the following:

download-source: "http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.4.tgz"

$ flintrock launch erik-flintrock --ec2-instance-type m3.medium
Launching 2 instances...
[52.40.196.100] SSH online.
[52.40.196.100] Configuring ephemeral storage...
[52.40.196.100] Installing HDFS...
[52.39.213.68] SSH online.
[52.39.213.68] Configuring ephemeral storage...
[52.39.213.68] Installing HDFS...
[52.39.213.68] Installing Spark...
[52.40.196.100] Installing Spark...
[52.39.213.68] Configuring HDFS master...
[52.39.213.68] Configuring Spark master...
HDFS online.
Spark Health Report:
  * Master: ALIVE
  * Workers: 1
  * Cores: 1
  * Memory: 2.7 GB            
launch finished in 0:04:00.

Then successfully fetching some data from S3:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Python version 3.5.1 (default, Dec  7 2015 11:16:01)
SparkContext available as sc, HiveContext available as sqlContext.   

In [1]: df = sqlContext.parquetFile('s3n://XYZ.gz.parquet')

In [2]: df.count()

nchammas · 2016-06-29T01:17:29Z

Took a second look at this. Looks good to me.

And thanks @ereed-tesla for testing it out. It speeds things up for me since I can skip on testing it myself if I am already comfortable with the PR.

Merging this in.

* master: 0.6.0 dev begins add some minor steps update standalone version in example this is 0.5.0 upgrade dependencies (nchammas#128) use latest Amazon Linux AMI rephrase note about future Windows support remove note about squashing PR commits up default Spark version to 1.6.2 add CHANGES for spark download source and additional security groups rename some internals related to security groups Resolve nchammas#72 add --ec2-security-group flag support (nchammas#112) added HADOOP_LIBEXEC_DIR env var (nchammas#127) Add option to download Spark from a custom URL (nchammas#125) add custom Hadoop URL change; reformat Markdown links

BenFradet added 4 commits June 21, 2016 19:15

added spark download source to the config template

6d8169a

modified the spark download script to take a single "url" argument

d43012a

modified the spark service to take a download source

93227fc

the launch command now takes a spark_download_source argument

ee69d99

BenFradet reviewed Jun 22, 2016
View reviewed changes

clarified download sources in the config template

ee9c00c

nchammas merged commit b7380ae into nchammas:master Jun 29, 2016

nchammas mentioned this pull request Jun 29, 2016

Cannot install package pre-built for Hadoop 2.4 #88

Closed

pyup-vuln-bot mentioned this pull request Oct 18, 2016

Changelog flintrock version 0.5.0 pyupio/safety-db#714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to download Spark from a custom URL #125

Add option to download Spark from a custom URL #125

BenFradet commented Jun 22, 2016 •

edited

Loading

BenFradet Jun 22, 2016

nchammas Jun 26, 2016

BenFradet Jun 27, 2016

nchammas Jun 27, 2016

nchammas commented Jun 26, 2016

BenFradet commented Jun 27, 2016

ereed-tesla commented Jun 28, 2016 •

edited

Loading

nchammas commented Jun 29, 2016

Add option to download Spark from a custom URL #125

Add option to download Spark from a custom URL #125

Conversation

BenFradet commented Jun 22, 2016 • edited Loading

BenFradet Jun 22, 2016

Choose a reason for hiding this comment

nchammas Jun 26, 2016

Choose a reason for hiding this comment

BenFradet Jun 27, 2016

Choose a reason for hiding this comment

nchammas Jun 27, 2016

Choose a reason for hiding this comment

nchammas commented Jun 26, 2016

BenFradet commented Jun 27, 2016

ereed-tesla commented Jun 28, 2016 • edited Loading

nchammas commented Jun 29, 2016

BenFradet commented Jun 22, 2016 •

edited

Loading

ereed-tesla commented Jun 28, 2016 •

edited

Loading