[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it #3737

JoshRosen · 2014-12-18T21:45:01Z

This patch upgrades spark-ec2's Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.

Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in SPARK_EC2_DIR/lib, similar to what we do in the sbt/sbt script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.

I've tested this with Python 2.6, too.

… PyPi instead of packaging it

JoshRosen · 2014-12-18T21:46:17Z

/cc @shivaram @pwendell @nchammas

nchammas · 2014-12-18T22:16:26Z

Patch looks good to me, but I'll try to test it out later this week.

JoshRosen · 2014-12-18T22:24:35Z

I think that the main risk of this patch is that boto has deprecated / removed / changed functionality that we rely on, but that we won't notice this until users run commands that exercise those branches (yay dynamic languages!).

I was able to launch a basic spot cluster, so I think that we should be in pretty good shape. I could try running a code coverage tool on this while I launch the cluster to see if there are any branches that I've missed, then just look through the Boto docs to see whether that functionality's still present.

OTOH, if Boto is good about maintaining backwards compatibility, then I guess it's probably fine to merge this and wait to see if we hit problems.

nchammas · 2014-12-18T22:29:52Z

If you run Python with the -Wdefault flag it should enable the display of deprecation warnings. They're suppressed by default. I remember catching one such warning with the version of boto we're currently on during a regular cluster launch.

Using a code coverage tool to make sure we don't miss any branches sounds like a good idea. What tool would you use to do that for Python?

JoshRosen · 2014-12-18T22:31:55Z

Ah, enabling warnings is a good idea. I could just add that flag to the spark-ec2 batch script.

For Python, I usually use coverage: https://pypi.python.org/pypi/coverage/3.7.1

nchammas · 2014-12-18T22:40:12Z

That sounds like a good idea. That way if we change something in the future to rely on a deprecated feature, we'll immediately notice during testing.

SparkQA · 2014-12-18T23:09:49Z

Test build #24602 has finished for PR 3737 at commit 587ae89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py:583: PendingDeprecationWarning: The current get_all_instances implementation will be replaced with get_all_reservations. Quoting from the Boto docs: def get_all_instances(self, instance_ids=None, filters=None, dry_run=False, max_results=None): """ Retrieve all the instance reservations associated with your account. .. note:: This method's current behavior is deprecated in favor of :meth:`get_all_reservations`. A future major release will change :meth:`get_all_instances` to return a list of :class:`boto.ec2.instance.Instance` objects as its name suggests. To obtain that behavior today, use :meth:`get_only_instances`.

JoshRosen · 2014-12-19T06:52:51Z

The deprecation warning idea was good; it turns out that there's a pending deprecation which will change the semantics of one of the methods we use, so I upgraded the code according to the documentation's suggestion.

For obtaining code coverage metrics, I applied the following change to the Bash script:

diff --git a/ec2/spark-ec2 b/ec2/spark-ec2
index 3abd3f3..4e3fc69 100755
--- a/ec2/spark-ec2
+++ b/ec2/spark-ec2
@@ -22,4 +22,6 @@
 #+ the underlying Python script.
 SPARK_EC2_DIR="$(dirname $0)"

-python -Wdefault "${SPARK_EC2_DIR}/spark_ec2.py" "$@"
+
+export PYTHONWARNINGS="default"
+coverage run -a "${SPARK_EC2_DIR}/spark_ec2.py" "$@"

The -a option tells coverage to accumulate information across multiple runs. I performed an iterative process where I interactively ran spark-ec2, used coverage html to generate a report, then went back and ran more commands to exercise the code areas that I missed.

With the workloads that I ran (launching spot clusters, stopping and starting a cluster, destroying / creating security groups, logging in, canceling spot instance requests), I got to 80% line coverage; most of the lines that I missed were error-handling code.

Here's a link to an ASCII coverage report, produced with coverage annotate spark_ec2.py: https://gist.github.com/JoshRosen/c09a742805bae3503185

According to the docs:

Usage: coverage annotate [options] [modules]

Make annotated copies of the given files, marking statements that are executed
with > and statements that are missed with !.

As you can see, the coverage is pretty good. Therefore, I'd be comfortable merging this PR now.

SparkQA · 2014-12-19T08:12:56Z

Test build #24630 has finished for PR 3737 at commit f02935d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2014-12-19T09:07:58Z

Nice work, Josh.

I took a quick look at the coverage report. It looks like most of it is covered. If we want to be extra thorough, I think there are a few more things relevant to boto that were not hit. (Or perhaps you hit them but they were not captured in this particular report file.)

L353 - launching into a VPC
L430, L439 - stuff related to EBS volumes and block mapping
L507 - regular launch
L531 - resume interrupted launch

By the way, I noticed some unused (and unreachable) code at L653. We can probably delete it.

JoshRosen · 2014-12-19T16:35:31Z

I think a few of the skipped lines themselves are okay, since L354, is similar to some calls right below it that were actually run. On the other hand, the fact that we didn't run the VPC code path means that we didn't end up calling run with the VPC arguments, so it's still possible that those could error out.

This is one of those cases where line coverage is kind of a minimum standard but not the end-all of coverage metrics, since it doesn't capture the space of different configurations / arguments that we call Boto with.

Since I've got to launch a cluster anyways, I'll try spinning up one more in a VPC just to be safe.

Good call on the setup_standalone_cluster code, by the way; it looks unused, so I'll remove it.

JoshRosen · 2014-12-19T16:51:20Z

Hmm, I tried running

./spark-ec2 \
  -t m3.xlarge \
  -s 2 \ 
  -k joshrosen \ 
  -i /Users/joshrosen/.ssh/joshrosen.pem \ 
  --ebs-vol-size 10 \
  --ebs-vol-num 2 \
  -r us-west-2 \
  --zone us-west-2a \
  --spark-version 1.1.0 \
  --swap 2048 \
  --vpc-id vpc-0778a362 \ 
  --subnet-id subnet-ebcb768e \
  launch josh-benchmarking3

Looks like it hit some sort of race-condition:

Setting up security groups...
Creating security group josh-benchmarking3-master
Creating security group josh-benchmarking3-slaves
Searching for existing cluster josh-benchmarking3...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 2 slaves in us-west-2a, regid = r-1f9f8914
Launched master in us-west-2a, regid = r-d49187df
Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
ERROR:boto:<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-92214398' does not exist</Message></Error></Errors><RequestID>22b657b9-f270-4795-a268-7a6bb3453947</RequestID></Response>
Traceback (most recent call last):
  File "./spark_ec2.py", line 1173, in <module>
    main()
  File "./spark_ec2.py", line 1165, in main
    real_main()
  File "./spark_ec2.py", line 1019, in real_main
    cluster_state='ssh-ready'
  File "./spark_ec2.py", line 714, in wait_for_cluster_state
    i.update()
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/instance.py", line 413, in update
    rs = self.connection.get_all_reservations([self.id], dry_run=dry_run)
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py", line 682, in get_all_reservations
    [('item', Reservation)], verb='POST')
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/connection.py", line 1182, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-92214398' does not exist</Message></Error></Errors><RequestID>22b657b9-f270-4795-a268-7a6bb3453947</RequestID></Response>

I've tried passing the --resume flag and the launch seems to be proceeding, so maybe this was just a transient error that's not related to this patch. Don't know for sure, though. Have you see this one before?

nchammas · 2014-12-19T17:00:41Z

Yes, I've been getting this type of error with 1.1.1. I haven't had time to look into it, but I suspect there is some subtle thing about the EC2 API that we are not honoring which occasionally leads to this problem. It could also just be general AWS flakiness (e.g. due to metadata like available instances needing to be replicated and whatnot) that we need to account for.

I don't think this is related to the changes in this PR.

nchammas · 2014-12-19T17:02:15Z

I'm curious: After calling --resume, did the instances in the EC2 web console get tagged with friendly names? That's one thing I noticed the last time I got this error. Things broke before spark-ec2 could tag the instances with names.

JoshRosen · 2014-12-19T17:04:02Z

Not sure, since I ended up accidentally launching my instances on a private subnet of my VPC so spark-ec2 was unable to SSH into them, so I had to start over. Trying again on a public subnet.

industrial-sloth · 2014-12-19T17:24:53Z

Interesting - I'm just now hitting the same error: <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-471a82b9' does not exist</Message></Error></Errors> after destroying one cluster and quickly launching another - although to be fair it is not obvious to me whether the rapid destroy / launch combo is actually related to this issue. I've quickly relaunched clusters a few times previously but this is my first time seeing this particular error.

@nchammas, you are correct that in my case the slaves aren't coming up with tags after --resume - the master however does have a nice name assigned to it.

JoshRosen · 2014-12-19T17:31:46Z

Hmm, so I guess I don't know how to properly configure AWS VPCs, since I've been having trouble even being able to manually SSH into EC2 instances launched in my VPC. Maybe I can defer VPC documentation improvements / instructions for a separate PR.

Trying to launch a non-spot cluster now as one final test.

JoshRosen · 2014-12-19T17:32:38Z

although to be fair it is not obvious to me whether the rapid destroy / launch combo is actually related to this issue.

I don't think it is, since I saw it after creating fresh security groups.

SparkQA · 2014-12-19T18:26:13Z

Test build #24652 has finished for PR 3737 at commit 0aa43cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-12-20T01:02:12Z

Since this doesn't appear to have introduced any new issues, I'm going to merge this into master in order to resolve the known issue with launching spot clusters. Let's open a separate PR to handle testing / documentation of the VPC feature.

nchammas · 2014-12-20T08:59:30Z

I suspect the instance ID 'i-471a82b9' does not exist errors stem from tagging the instances in a separate call from the one that launches them. The time between launch and tagging is so small that occasionally we hit some AWS metadata replication slowness which manifests as instance does not exist.

The fix would be to either tag the instances in the same call that launches them if possible, and failing that to retry tagging them after a very short delay (e.g. 0.3s) if the initial attempt fails.

PR #3737 changed `spark-ec2` to automatically download boto from PyPI. This PR tell git to ignore those downloaded library files. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3770 from nchammas/ignore-ec2-lib and squashes the following commits: 5c440d3 [Nicholas Chammas] gitignore downloaded EC2 libs

nchammas · 2014-12-24T06:53:29Z

@JoshRosen There are a couple things that we may want to change about this new behavior in the future. The first is that --help now requires a download to work, which may be surprising to users. The second (unlikely issue, but nonetheless) is that people running spark-ec2 off a read-only mount won't be able to do so anymore.

JoshRosen · 2014-12-24T20:43:17Z

@nchammas Thanks for raising those concerns. The --help issue might not be too hard to fix (we may be able to do some lazy-loading of boto). For read-only mounts, I don't see a great solution: I don't want to continue bundling a zip file in the Spark source, since the boto download is huge (even after compression). Maybe we could package it when making binary distributions, though.

piskvorky · 2015-09-17T06:49:11Z

@JoshRosen how exactly does boto get distributed to the cluster EC2 machines?

My app is failing and it seems to be connected to the fact that all EC2 nodes have boto 2.8.0 (as opposed to the driver machine, which has correctly downloaded and used 2.34.0).

JoshRosen · 2015-09-17T21:32:28Z

@piskvorky, the boto on the EC2 instances itself should be provided through the AMI, AFAIK.

piskvorky · 2015-09-18T01:43:28Z

Got it, thanks @JoshRosen!

By the way, would you be interested in a spark_ec2.py patch that exposes its functionality programatically as well?

Right now, it's a bit hard to use as part of a larger pipeline, with all the sys.exits and prints and raw_inputs sprinkled inside.

piskvorky · 2015-09-18T01:44:10Z

(sorry to hijack this issue, will open a new one if you think the patch is worth it)

[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from…

587ae89

… PyPi instead of packaging it

JoshRosen mentioned this pull request Dec 18, 2014

[SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py #2872

Closed

Remove unused setup_standalone_cluster() method.

0aa43cc

asfgit closed this in c28083f Dec 20, 2014

nchammas mentioned this pull request Dec 23, 2014

[SPARK-4890] Ignore downloaded EC2 libs #3770

Closed

JoshRosen deleted the update-boto branch September 17, 2015 21:32

[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it #3737

[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it #3737

Uh oh!

Conversation

JoshRosen commented Dec 18, 2014

Uh oh!

JoshRosen commented Dec 18, 2014

Uh oh!

nchammas commented Dec 18, 2014

Uh oh!

JoshRosen commented Dec 18, 2014

Uh oh!

nchammas commented Dec 18, 2014

Uh oh!

JoshRosen commented Dec 18, 2014

Uh oh!

nchammas commented Dec 18, 2014

Uh oh!

SparkQA commented Dec 18, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

SparkQA commented Dec 19, 2014

Uh oh!

nchammas commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

nchammas commented Dec 19, 2014

Uh oh!

nchammas commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

industrial-sloth commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 19, 2014

Uh oh!

SparkQA commented Dec 19, 2014

Uh oh!

JoshRosen commented Dec 20, 2014

Uh oh!

nchammas commented Dec 20, 2014

Uh oh!

nchammas commented Dec 24, 2014

Uh oh!

JoshRosen commented Dec 24, 2014

Uh oh!

piskvorky commented Sep 17, 2015

Uh oh!

JoshRosen commented Sep 17, 2015

Uh oh!

piskvorky commented Sep 18, 2015

Uh oh!

piskvorky commented Sep 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants