Merge branch 'master' into add-slave

* master: 0.6.0 dev begins add some minor steps update standalone version in example this is 0.5.0 upgrade dependencies (nchammas#128) use latest Amazon Linux AMI rephrase note about future Windows support remove note about squashing PR commits up default Spark version to 1.6.2 add CHANGES for spark download source and additional security groups rename some internals related to security groups Resolve nchammas#72 add --ec2-security-group flag support (nchammas#112) added HADOOP_LIBEXEC_DIR env var (nchammas#127) Add option to download Spark from a custom URL (nchammas#125) add custom Hadoop URL change; reformat Markdown links
devsisters · Jul 22, 2016 · a311203 · a311203
2 parents 648e2cf + 1e15d1a
commit a311203
Show file tree

Hide file tree

Showing 15 changed files with 205 additions and 65 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,52 +1,131 @@
 # Change Log
 
+## [Unreleased]
 
-## [Unreleased](https://github.com/nchammas/flintrock/compare/v0.4.0...master)
+Nothing notable yet.
+
+[Unreleased]: https://github.com/nchammas/flintrock/compare/v0.5.0...master
+
+## [0.5.0] - 2016-07-20
+
+[0.5.0]: https://github.com/nchammas/flintrock/compare/v0.4.0...v0.5.0
+
+### Added
+
+* [#118]: You can now specify `--hdfs-download-source` (or the
+  equivalent in your config file) to tell Flintrock to download Hadoop
+  from a specific URL when launching your cluster.
+* [#125]: You can now specify `--spark-download-source` (or the
+  equivalent in your config file) to tell Flintrock to download Spark
+  from a specific URL when launching your cluster.
+* [#112]: You can now specify `--ec2-security-group` to associate
+  additional security groups with your cluster on launch.
+
+[#118]: https://github.com/nchammas/flintrock/pull/118
+[#125]: https://github.com/nchammas/flintrock/pull/125
+[#112]: https://github.com/nchammas/flintrock/pull/112
 
 ### Changed
 
-* [#103](https://github.com/nchammas/flintrock/pull/103): Flintrock now opens port 7077 so local
-  clients like Apache Zeppelin can connect directly to the Spark master on the cluster.
+* [#103], [#114]: Flintrock now opens port 6066 and 7077 so local
+  clients like Apache Zeppelin can connect directly to the Spark
+  master on the cluster.
+* [#122]: Flintrock now automatically adds executables like
+  `spark-submit`, `pyspark`, and `hdfs` to the default `PATH`, so
+  they're available to call right when you login to the cluster.
 
+[#103]: https://github.com/nchammas/flintrock/pull/103
+[#114]: https://github.com/nchammas/flintrock/pull/114
+[#122]: https://github.com/nchammas/flintrock/pull/122
 
-## [0.4.0](https://github.com/nchammas/flintrock/compare/v0.3.0...v0.4.0) - 2016-03-27
+## [0.4.0] - 2016-03-27
+
+[0.4.0]: https://github.com/nchammas/flintrock/compare/v0.3.0...v0.4.0
 
 ### Added
 
-* [#98](https://github.com/nchammas/flintrock/pull/98), [#99](https://github.com/nchammas/flintrock/pull/99): You can now specify `latest` for `--spark-git-commit` and Flintrock will automatically build Spark on your cluster at the latest commit. This feature is only available for Spark repos hosted on GitHub.
-* [#94](https://github.com/nchammas/flintrock/pull/94): Flintrock now supports launching clusters into non-default VPCs.
+* [#98], [#99]: You can now specify `latest` for `--spark-git-commit`
+  and Flintrock will automatically build Spark on your cluster at the
+  latest commit. This feature is only available for Spark repos
+  hosted on GitHub.
+* [#94]: Flintrock now supports launching clusters into non-default
+  VPCs.
+
+[#94]: https://github.com/nchammas/flintrock/pull/94
+[#98]: https://github.com/nchammas/flintrock/pull/98
+[#99]: https://github.com/nchammas/flintrock/pull/99
 
 ### Changed
 
-* [#86](https://github.com/nchammas/flintrock/pull/86): Flintrock now correctly catches when spot requests fail and bubbles up an appropriate error message.
-* [#93](https://github.com/nchammas/flintrock/pull/93), [#97](https://github.com/nchammas/flintrock/pull/97): Fixed the ability to build Spark from git. (It was broken for recent commits.)
-* [#96](https://github.com/nchammas/flintrock/pull/96), [#100](https://github.com/nchammas/flintrock/pull/100): Flintrock launches should now work correctly whether the default Python on the cluster is Python 2.7 or Python 3.4+.
+* [#86]: Flintrock now correctly catches when spot requests fail and
+  bubbles up an appropriate error message.
+* [#93], [#97]: Fixed the ability to build Spark from git. (It was
+  broken for recent commits.)
+* [#96], [#100]: Flintrock launches should now work correctly whether
+  the default Python on the cluster is Python 2.7 or Python 3.4+.
 
+[#86]: https://github.com/nchammas/flintrock/pull/86
+[#93]: https://github.com/nchammas/flintrock/pull/93
+[#96]: https://github.com/nchammas/flintrock/pull/96
+[#97]: https://github.com/nchammas/flintrock/pull/97
+[#100]: https://github.com/nchammas/flintrock/pull/100
 
-## [0.3.0](https://github.com/nchammas/flintrock/compare/v0.2.0...v0.3.0) - 2016-02-14
+## [0.3.0] - 2016-02-14
+
+[0.3.0]: https://github.com/nchammas/flintrock/compare/v0.2.0...v0.3.0
 
 ### Changed
 
-* [`eca59fc`](https://github.com/nchammas/flintrock/commit/eca59fc0052874d9aa48b7d4d7d79192b5e609d1), [`3cf6ee6`](https://github.com/nchammas/flintrock/commit/3cf6ee64162ceaac6429d79c3bc6ef25988eaa8e): Tweaked a few things so that Flintrock can launch 200+ node clusters without hitting certain limits.
+* [`eca59fc`], [`3cf6ee6`]: Tweaked a few things so that Flintrock
+  can launch 200+ node clusters without hitting certain limits.
 
+[`eca59fc`]: https://github.com/nchammas/flintrock/commit/eca59fc0052874d9aa48b7d4d7d79192b5e609d1
+[`3cf6ee6`]: https://github.com/nchammas/flintrock/commit/3cf6ee64162ceaac6429d79c3bc6ef25988eaa8e
 
-## [0.2.0](https://github.com/nchammas/flintrock/compare/v0.1.0...v0.2.0) - 2016-02-07
+## [0.2.0] - 2016-02-07
 
-### Added
+[0.2.0]: https://github.com/nchammas/flintrock/compare/v0.1.0...v0.2.0
 
-* [`b00fd12`](https://github.com/nchammas/flintrock/commit/b00fd128f36e0a05dafca69b26c4d1b190fa42c9): Added `--assume-yes` option to the `launch` command. Use `--assume-yes` to tell Flintrock to automatically destroy the cluster if there are problems during launch.
+### Added
 
-### Changed
+* [`b00fd12`]: Added `--assume-yes` option to the `launch` command.
+  Use `--assume-yes` to tell Flintrock to automatically destroy the
+  cluster if there are problems during launch.
 
-* [#69](https://github.com/nchammas/flintrock/pull/69): Automatically retry Hadoop download from flaky Apache mirrors.
-* [`0df7004`](https://github.com/nchammas/flintrock/commit/0df70043f3da215fe699165bc961bd0c4ba4ea88): Delete unneeded security group after a cluster is destroyed.
-* [`244f734`](https://github.com/nchammas/flintrock/commit/244f7345696d1b8cec1d1b575a304b9bd9a77840): Default HDFS not to install. Going forward, Spark will be the only service that Flintrock installs by default. Defaults can easily be changed via Flintrock's config file.
-* [`de33412`](https://github.com/nchammas/flintrock/commit/de3341221ca8d57f5a465b13f07c8e266ae11a59): Flintrock installs services, not modules. The terminology has been updated accordingly throughout the code and docs. Update your config file to use `services` instead of `modules`. **Warning**: Flintrock will have problems managing existing clusters that were launched with versions of Flintrock from before this change.
-* [#73](https://github.com/nchammas/flintrock/pull/73): Major refactoring of Flintrock internals.
-* [#74](https://github.com/nchammas/flintrock/pull/74): Flintrock now catches common configuration problems upfront and provides simple error messages, instead of barfing out errors from EC2 or launching broken clusters.
-* [`bf766ba`](https://github.com/nchammas/flintrock/commit/bf766ba48f12a8752c2e32f9b3daf29501c30866): Fixed a bug in how Flintrock polls SSH availability from Linux. Cluster launches now work from Linux as intended. 
+[`b00fd12`]: https://github.com/nchammas/flintrock/commit/b00fd128f36e0a05dafca69b26c4d1b190fa42c9
 
+### Changed
 
-## [0.1.0](https://github.com/nchammas/flintrock/releases/tag/v0.1.0) - 2015-12-11
+* [#69]: Automatically retry Hadoop download from flaky Apache
+  mirrors.
+* [`0df7004`]: Delete unneeded security group after a cluster is
+  destroyed.
+* [`244f734`]: Default HDFS not to install. Going forward, Spark will
+  be the only service that Flintrock installs by default. Defaults can
+  easily be changed via Flintrock's config file.
+* [`de33412`]: Flintrock installs services, not modules. The
+  terminology has been updated accordingly throughout the code and
+  docs. Update your config file to use `services` instead of
+  `modules`. **Warning**: Flintrock will have problems managing
+  existing clusters that were launched with versions of Flintrock from
+  before this change.
+* [#73]: Major refactoring of Flintrock internals.
+* [#74]: Flintrock now catches common configuration problems upfront
+  and provides simple error messages, instead of barfing out errors
+  from EC2 or launching broken clusters.
+* [`bf766ba`]: Fixed a bug in how Flintrock polls SSH availability
+  from Linux. Cluster launches now work from Linux as intended.
+
+[#69]: https://github.com/nchammas/flintrock/pull/69
+[`0df7004`]: https://github.com/nchammas/flintrock/commit/0df70043f3da215fe699165bc961bd0c4ba4ea88
+[`244f734`]: https://github.com/nchammas/flintrock/commit/244f7345696d1b8cec1d1b575a304b9bd9a77840
+[`de33412`]: https://github.com/nchammas/flintrock/commit/de3341221ca8d57f5a465b13f07c8e266ae11a59
+[#73]: https://github.com/nchammas/flintrock/pull/73
+[#74]: https://github.com/nchammas/flintrock/pull/74
+[`bf766ba`]: https://github.com/nchammas/flintrock/commit/bf766ba48f12a8752c2e32f9b3daf29501c30866
+
+## [0.1.0] - 2015-12-11
+
+[0.1.0]: https://github.com/nchammas/flintrock/releases/tag/v0.1.0
 
 * Initial release.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -98,8 +98,6 @@ When building something new, don't just consider the value it will provide. Cons
 
 Make sure each pull request you submit captures a single coherent idea. This limits the scope of any given pull request and makes it much easier for a reviewer to understand what you are doing and give precise feedback. Don't mix logically independent changes in the same request if they can be submitted separately.
 
-After you and the reviewers agree that a pull request is ready to be accepted, you will be asked to squash your commits into one before your change is merged in. This helps us ensure that every commit in Flintrock's history represents a working state, and makes changes easier to browse through and understand.
-
 #### Expect many revisions
 
 If you are adding or touching lots of code, then be prepared to go through many rounds of revisions before your pull request is accepted. This is normal, especially as you are still getting acquainted with the project's standards and style.

diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ Here's a quick way to launch a cluster on EC2, assuming you already have an [AWS
 ```sh
 flintrock launch test-cluster \
     --num-slaves 1 \
-    --spark-version 1.6.1 \
+    --spark-version 1.6.2 \
     --ec2-key-name key_name \
     --ec2-identity-file /path/to/key.pem \
     --ec2-ami ami-08111162 \
@@ -58,9 +58,17 @@ That's not all. Flintrock has a few more [features](#features) that you may find
 
 ## Installation
 
-Before using Flintrock, take a quick look at the [copyright](https://github.com/nchammas/flintrock/blob/master/COPYRIGHT) notice and [license](https://github.com/nchammas/flintrock/blob/master/LICENSE) and make sure you're OK with their terms.
+Before using Flintrock, take a quick look at the
+[copyright](https://github.com/nchammas/flintrock/blob/master/COPYRIGHT)
+notice and [license](https://github.com/nchammas/flintrock/blob/master/LICENSE)
+and make sure you're OK with their terms.
 
-**Flintrock requires Python 3.4 or newer**, unless you are using one of our **standalone packages**. Flintrock has been thoroughly tested only on OS X, but it should run on all POSIX systems. We have plans to [add Windows support](https://github.com/nchammas/flintrock/issues/46) in the future, too.
+**Flintrock requires Python 3.4 or newer**, unless you are using one
+of our **standalone packages**. Flintrock has been thoroughly tested
+only on OS X, but it should run on all POSIX systems.
+A motivated contributor should be able to add
+[Windows support](https://github.com/nchammas/flintrock/issues/46)
+without too much trouble, too.
 
 ### Release version
 
@@ -91,7 +99,7 @@ unzip it to a location of your choice, and run the `flintrock` executable inside
 For example:
 
 ```sh
-flintrock_version="0.4.0"
+flintrock_version="0.5.0"
 
 curl --location --remote-name "https://github.com/nchammas/flintrock/releases/download/v$flintrock_version/Flintrock-$flintrock_version-standalone-OSX-x86_64.zip"
 unzip -q -d flintrock "Flintrock-$flintrock_version-standalone-OSX-x86_64.zip"
@@ -186,7 +194,7 @@ provider: ec2
 
 services:
   spark:
-    version: 1.6.1
+    version: 1.6.2
 
 launch:
   num-slaves: 1

diff --git a/flintrock/__init__.py b/flintrock/__init__.py
@@ -1,2 +1,2 @@
 # See: https://packaging.python.org/en/latest/distributing/#standards-compliance-for-interoperability
-__version__ = '0.5.0.dev0'
+__version__ = '0.6.0.dev0'
diff --git a/flintrock/config.yaml.template b/flintrock/config.yaml.template
@@ -1,12 +1,18 @@
 services:
   spark:
-    version: 1.6.1
+    version: 1.6.2
     # git-commit: latest  # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
     # git-repository:  # optional; defaults to https://github.com/apache/spark
+    # optional; defaults to download from from the official Spark S3 bucket
+    #   - must contain a {v} template corresponding to the version
+    #   - Spark must be pre-built
+    #   - must be a tar.gz file
+    # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
   hdfs:
     version: 2.7.2
     # optional; defaults to download from a dynamically selected Apache mirror
-    # must contain a {v} template corresponding to the version; must be a .tar.gz file
+    #   - must contain a {v} template corresponding to the version
+    #   - must be a .tar.gz file
     # download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
 
 provider: ec2
@@ -18,14 +24,17 @@ providers:
     instance-type: m3.medium
     region: us-east-1
     # availability-zone: <name>
-    ami: ami-08111162   # Amazon Linux, us-east-1
+    ami: ami-6869aa05   # Amazon Linux, us-east-1
     user: ec2-user
     # ami: ami-61bbf104   # CentOS 7, us-east-1
     # user: centos
     # spot-price: <price>
     # vpc-id: <id>
     # subnet-id: <id>
     # placement-group: <name>
+    # security-groups:
+    #   - group-name1
+    #   - group-name2
     tenancy: default  # default | dedicated
     ebs-optimized: no  # yes | no
     instance-initiated-shutdown-behavior: terminate  # terminate | stop

diff --git a/flintrock/ec2.py b/flintrock/ec2.py
@@ -317,7 +317,33 @@ def check_network_config(*, region_name: str, vpc_id: str, subnet_id: str):
         )
 
 
-def get_or_create_ec2_security_groups(
+def get_security_groups(
+        *,
+        vpc_id,
+        region,
+        security_group_names) -> "List[boto3.resource('ec2').SecurityGroup]":
+    ec2 = boto3.resource(service_name='ec2', region_name=region)
+
+    groups = list(
+        ec2.security_groups.filter(
+            Filters=[
+                {'Name': 'group-name', 'Values': security_group_names},
+                {'Name': 'vpc-id', 'Values': [vpc_id]},
+            ]))
+
+    found_group_names = [group.group_name for group in groups]
+    missing_group_names = set(security_group_names) - set(found_group_names)
+    if missing_group_names:
+        raise Error(
+            "Could not find the following security group{s}: {groups}"
+            .format(
+                s='' if len(missing_group_names) == 1 else 's',
+                groups=', '.join(list(missing_group_names))))
+
+    return groups
+
+
+def get_or_create_flintrock_security_groups(
         *,
         cluster_name,
         vpc_id,
@@ -511,6 +537,7 @@ def launch(
         availability_zone,
         ami,
         user,
+        security_groups,
         spot_price=None,
         vpc_id,
         subnet_id,
@@ -547,10 +574,15 @@ def launch(
                 v=vpc_id))
 
     try:
-        security_groups = get_or_create_ec2_security_groups(
+        flintrock_security_groups = get_or_create_flintrock_security_groups(
             cluster_name=cluster_name,
             vpc_id=vpc_id,
             region=region)
+        user_security_groups = get_security_groups(
+            vpc_id=vpc_id,
+            region=region,
+            security_group_names=security_groups)
+        security_group_ids = [sg.id for sg in user_security_groups + flintrock_security_groups]
         block_device_mappings = get_ec2_block_device_mappings(
             ami=ami,
             region=region)
@@ -585,7 +617,7 @@ def launch(
                     'Placement': {
                         'AvailabilityZone': availability_zone,
                         'GroupName': placement_group},
-                    'SecurityGroupIds': [sg.id for sg in security_groups],
+                    'SecurityGroupIds': security_group_ids,
                     'SubnetId': subnet_id,
                     'IamInstanceProfile': {
                         'Name': instance_profile_name},
@@ -634,7 +666,7 @@ def launch(
                     'AvailabilityZone': availability_zone,
                     'Tenancy': tenancy,
                     'GroupName': placement_group},
-                SecurityGroupIds=[sg.id for sg in security_groups],
+                SecurityGroupIds=security_group_ids,
                 SubnetId=subnet_id,
                 IamInstanceProfile={
                     'Name': instance_profile_name},

diff --git a/flintrock/flintrock.py b/flintrock/flintrock.py
@@ -186,6 +186,10 @@ def cli(cli_context, config, provider):
 @click.option('--install-spark/--no-install-spark', default=True)
 @click.option('--spark-version',
               help="Spark release version to install.")
+@click.option('--spark-download-source',
+              help="URL to download a release of Spark from.",
+              default='https://s3.amazonaws.com/spark-related-packages/spark-{v}-bin-hadoop2.6.tgz',
+              show_default=True)
 @click.option('--spark-git-commit',
               help="Git commit to build Spark from. "
                    "Set to 'latest' to build Spark from the latest commit on the "
@@ -206,6 +210,10 @@ def cli(cli_context, config, provider):
 @click.option('--ec2-availability-zone', default='')
 @click.option('--ec2-ami')
 @click.option('--ec2-user')
+@click.option('--ec2-security-group', 'ec2_security_groups',
+              multiple=True,
+              help="Additional security groups names to assign to the instances. "
+                   "You can specify this option multiple times.")
 @click.option('--ec2-spot-price', type=float)
 @click.option('--ec2-vpc-id', default='', help="Leave empty for default VPC.")
 @click.option('--ec2-subnet-id', default='')
@@ -227,6 +235,7 @@ def launch(
         spark_version,
         spark_git_commit,
         spark_git_repository,
+        spark_download_source,
         assume_yes,
         ec2_key_name,
         ec2_identity_file,
@@ -235,6 +244,7 @@ def launch(
         ec2_availability_zone,
         ec2_ami,
         ec2_user,
+        ec2_security_groups,
         ec2_spot_price,
         ec2_vpc_id,
         ec2_subnet_id,
@@ -289,7 +299,7 @@ def launch(
         services += [hdfs]
     if install_spark:
         if spark_version:
-            spark = Spark(version=spark_version)
+            spark = Spark(version=spark_version, download_source=spark_download_source)
         elif spark_git_commit:
             print(
                 "Warning: Building Spark takes a long time. "
@@ -315,6 +325,7 @@ def launch(
             availability_zone=ec2_availability_zone,
             ami=ec2_ami,
             user=ec2_user,
+            security_groups=ec2_security_groups,
             spot_price=ec2_spot_price,
             vpc_id=ec2_vpc_id,
             subnet_id=ec2_subnet_id,