Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to download Spark from a custom URL #125

Merged
merged 5 commits into from
Jun 29, 2016
Merged

Add option to download Spark from a custom URL #125

merged 5 commits into from
Jun 29, 2016

Conversation

BenFradet
Copy link
Contributor

@BenFradet BenFradet commented Jun 22, 2016

This PR adds a download_source to the Spark service as has been done in #118.

I created clusters with both with and without a download_source to test the new feature.

Fixes #101, fixes #88.
Closes #104.

'git_commit': git_commit,
'git_repository': git_repository}

def install(
self,
ssh_client: paramiko.client.SSHClient,
cluster: FlintrockCluster):
# TODO: Allow users to specify the Spark "distribution". (?)
distribution = 'hadoop2.6'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up we could support a {d} template in download_source as is done with the version with {v}.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and that would address #88, though it seems like with this PR you can already choose your distribution at will, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you can choose your distribution if you specify your own download source.

However, we might want to support the use case of someone only specifying the spark version and distribution. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for now let's leave it like this. I have some vague concerns about "officially" supporting other distributions, in case they have annoying problems that we would have to work around. With the download source option, people who really want a different distribution can get it, and we have a bit more of an excuse to deflect support if there are serious issues.

It's definitely something I am open to revisiting in the future, though.

@nchammas
Copy link
Owner

Thank you for this PR @BenFradet. Looks good to me! I left some minor comments.

@BenFradet
Copy link
Contributor Author

Great, thanks for your review, will update accordingly.

@ereed-tesla
Copy link

ereed-tesla commented Jun 28, 2016

This is slick -- S3 support via the Spark hadoop-2.4 binary is pretty convenient. Is there anything remaining to get this merged in?

I confirmed this PR works by merging into master and doing the following:

download-source: "http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.4.tgz"

$ flintrock launch erik-flintrock --ec2-instance-type m3.medium
Launching 2 instances...
[52.40.196.100] SSH online.
[52.40.196.100] Configuring ephemeral storage...
[52.40.196.100] Installing HDFS...
[52.39.213.68] SSH online.
[52.39.213.68] Configuring ephemeral storage...
[52.39.213.68] Installing HDFS...
[52.39.213.68] Installing Spark...
[52.40.196.100] Installing Spark...
[52.39.213.68] Configuring HDFS master...
[52.39.213.68] Configuring Spark master...
HDFS online.
Spark Health Report:
  * Master: ALIVE
  * Workers: 1
  * Cores: 1
  * Memory: 2.7 GB            
launch finished in 0:04:00.

Then successfully fetching some data from S3:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Python version 3.5.1 (default, Dec  7 2015 11:16:01)
SparkContext available as sc, HiveContext available as sqlContext.   

In [1]: df = sqlContext.parquetFile('s3n://XYZ.gz.parquet')

In [2]: df.count()

@nchammas
Copy link
Owner

Took a second look at this. Looks good to me.

And thanks @ereed-tesla for testing it out. It speeds things up for me since I can skip on testing it myself if I am already comfortable with the PR.

Merging this in.

@nchammas nchammas merged commit b7380ae into nchammas:master Jun 29, 2016
exLittlePond pushed a commit to devsisters/flintrock that referenced this pull request Jul 22, 2016
* master:
  0.6.0 dev begins
  add some minor steps
  update standalone version in example
  this is 0.5.0
  upgrade dependencies (nchammas#128)
  use latest Amazon Linux AMI
  rephrase note about future Windows support
  remove note about squashing PR commits
  up default Spark version to 1.6.2
  add CHANGES for spark download source and additional security groups
  rename some internals related to security groups
  Resolve nchammas#72 add --ec2-security-group flag support (nchammas#112)
  added HADOOP_LIBEXEC_DIR env var (nchammas#127)
  Add option to download Spark from a custom URL (nchammas#125)
  add custom Hadoop URL change; reformat Markdown links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to download Spark from a custom URL Cannot install package pre-built for Hadoop 2.4
3 participants