Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch upgrades spark-ec2's Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.

Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in SPARK_EC2_DIR/lib, similar to what we do in the sbt/sbt script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.

I've tested this with Python 2.6, too.

@JoshRosen
Copy link
Contributor Author

/cc @shivaram @pwendell @nchammas

@nchammas
Copy link
Contributor

Patch looks good to me, but I'll try to test it out later this week.

@JoshRosen
Copy link
Contributor Author

I think that the main risk of this patch is that boto has deprecated / removed / changed functionality that we rely on, but that we won't notice this until users run commands that exercise those branches (yay dynamic languages!).

I was able to launch a basic spot cluster, so I think that we should be in pretty good shape. I could try running a code coverage tool on this while I launch the cluster to see if there are any branches that I've missed, then just look through the Boto docs to see whether that functionality's still present.

OTOH, if Boto is good about maintaining backwards compatibility, then I guess it's probably fine to merge this and wait to see if we hit problems.

@nchammas
Copy link
Contributor

If you run Python with the -Wdefault flag it should enable the display of deprecation warnings. They're suppressed by default. I remember catching one such warning with the version of boto we're currently on during a regular cluster launch.

Using a code coverage tool to make sure we don't miss any branches sounds like a good idea. What tool would you use to do that for Python?

@JoshRosen
Copy link
Contributor Author

Ah, enabling warnings is a good idea. I could just add that flag to the spark-ec2 batch script.

For Python, I usually use coverage: https://pypi.python.org/pypi/coverage/3.7.1

@nchammas
Copy link
Contributor

That sounds like a good idea. That way if we change something in the future to rely on a deprecated feature, we'll immediately notice during testing.

@SparkQA
Copy link

SparkQA commented Dec 18, 2014

Test build #24602 has finished for PR 3737 at commit 587ae89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

  /Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py:583: PendingDeprecationWarning: The current get_all_instances implementation will be replaced with get_all_reservations.

Quoting from the Boto docs:

    def get_all_instances(self, instance_ids=None, filters=None, dry_run=False,
                          max_results=None):
        """
        Retrieve all the instance reservations associated with your account.

        .. note::
        This method's current behavior is deprecated in favor of
        :meth:`get_all_reservations`.  A future major release will change
        :meth:`get_all_instances` to return a list of
        :class:`boto.ec2.instance.Instance` objects as its name suggests.
        To obtain that behavior today, use :meth:`get_only_instances`.
@JoshRosen
Copy link
Contributor Author

The deprecation warning idea was good; it turns out that there's a pending deprecation which will change the semantics of one of the methods we use, so I upgraded the code according to the documentation's suggestion.

For obtaining code coverage metrics, I applied the following change to the Bash script:

diff --git a/ec2/spark-ec2 b/ec2/spark-ec2
index 3abd3f3..4e3fc69 100755
--- a/ec2/spark-ec2
+++ b/ec2/spark-ec2
@@ -22,4 +22,6 @@
 #+ the underlying Python script.
 SPARK_EC2_DIR="$(dirname $0)"

-python -Wdefault "${SPARK_EC2_DIR}/spark_ec2.py" "$@"
+
+export PYTHONWARNINGS="default"
+coverage run -a "${SPARK_EC2_DIR}/spark_ec2.py" "$@"

The -a option tells coverage to accumulate information across multiple runs. I performed an iterative process where I interactively ran spark-ec2, used coverage html to generate a report, then went back and ran more commands to exercise the code areas that I missed.

With the workloads that I ran (launching spot clusters, stopping and starting a cluster, destroying / creating security groups, logging in, canceling spot instance requests), I got to 80% line coverage; most of the lines that I missed were error-handling code.

Here's a link to an ASCII coverage report, produced with coverage annotate spark_ec2.py: https://gist.github.com/JoshRosen/c09a742805bae3503185

According to the docs:

Usage: coverage annotate [options] [modules]

Make annotated copies of the given files, marking statements that are executed
with > and statements that are missed with !.

As you can see, the coverage is pretty good. Therefore, I'd be comfortable merging this PR now.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24630 has finished for PR 3737 at commit f02935d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor

Nice work, Josh.

I took a quick look at the coverage report. It looks like most of it is covered. If we want to be extra thorough, I think there are a few more things relevant to boto that were not hit. (Or perhaps you hit them but they were not captured in this particular report file.)

  • L353 - launching into a VPC
  • L430, L439 - stuff related to EBS volumes and block mapping
  • L507 - regular launch
  • L531 - resume interrupted launch

By the way, I noticed some unused (and unreachable) code at L653. We can probably delete it.

@JoshRosen
Copy link
Contributor Author

I think a few of the skipped lines themselves are okay, since L354, is similar to some calls right below it that were actually run. On the other hand, the fact that we didn't run the VPC code path means that we didn't end up calling run with the VPC arguments, so it's still possible that those could error out.

This is one of those cases where line coverage is kind of a minimum standard but not the end-all of coverage metrics, since it doesn't capture the space of different configurations / arguments that we call Boto with.

Since I've got to launch a cluster anyways, I'll try spinning up one more in a VPC just to be safe.

Good call on the setup_standalone_cluster code, by the way; it looks unused, so I'll remove it.

@JoshRosen
Copy link
Contributor Author

Hmm, I tried running

./spark-ec2 \
  -t m3.xlarge \
  -s 2 \ 
  -k joshrosen \ 
  -i /Users/joshrosen/.ssh/joshrosen.pem \ 
  --ebs-vol-size 10 \
  --ebs-vol-num 2 \
  -r us-west-2 \
  --zone us-west-2a \
  --spark-version 1.1.0 \
  --swap 2048 \
  --vpc-id vpc-0778a362 \ 
  --subnet-id subnet-ebcb768e \
  launch josh-benchmarking3

Looks like it hit some sort of race-condition:

Setting up security groups...
Creating security group josh-benchmarking3-master
Creating security group josh-benchmarking3-slaves
Searching for existing cluster josh-benchmarking3...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 2 slaves in us-west-2a, regid = r-1f9f8914
Launched master in us-west-2a, regid = r-d49187df
Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
ERROR:boto:<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-92214398' does not exist</Message></Error></Errors><RequestID>22b657b9-f270-4795-a268-7a6bb3453947</RequestID></Response>
Traceback (most recent call last):
  File "./spark_ec2.py", line 1173, in <module>
    main()
  File "./spark_ec2.py", line 1165, in main
    real_main()
  File "./spark_ec2.py", line 1019, in real_main
    cluster_state='ssh-ready'
  File "./spark_ec2.py", line 714, in wait_for_cluster_state
    i.update()
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/instance.py", line 413, in update
    rs = self.connection.get_all_reservations([self.id], dry_run=dry_run)
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py", line 682, in get_all_reservations
    [('item', Reservation)], verb='POST')
  File "/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/connection.py", line 1182, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-92214398' does not exist</Message></Error></Errors><RequestID>22b657b9-f270-4795-a268-7a6bb3453947</RequestID></Response>

I've tried passing the --resume flag and the launch seems to be proceeding, so maybe this was just a transient error that's not related to this patch. Don't know for sure, though. Have you see this one before?

@nchammas
Copy link
Contributor

Yes, I've been getting this type of error with 1.1.1. I haven't had time to look into it, but I suspect there is some subtle thing about the EC2 API that we are not honoring which occasionally leads to this problem. It could also just be general AWS flakiness (e.g. due to metadata like available instances needing to be replicated and whatnot) that we need to account for.

I don't think this is related to the changes in this PR.

@nchammas
Copy link
Contributor

I'm curious: After calling --resume, did the instances in the EC2 web console get tagged with friendly names? That's one thing I noticed the last time I got this error. Things broke before spark-ec2 could tag the instances with names.

@JoshRosen
Copy link
Contributor Author

Not sure, since I ended up accidentally launching my instances on a private subnet of my VPC so spark-ec2 was unable to SSH into them, so I had to start over. Trying again on a public subnet.

@industrial-sloth
Copy link
Contributor

Interesting - I'm just now hitting the same error: <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-471a82b9' does not exist</Message></Error></Errors> after destroying one cluster and quickly launching another - although to be fair it is not obvious to me whether the rapid destroy / launch combo is actually related to this issue. I've quickly relaunched clusters a few times previously but this is my first time seeing this particular error.

@nchammas, you are correct that in my case the slaves aren't coming up with tags after --resume - the master however does have a nice name assigned to it.

@JoshRosen
Copy link
Contributor Author

Hmm, so I guess I don't know how to properly configure AWS VPCs, since I've been having trouble even being able to manually SSH into EC2 instances launched in my VPC. Maybe I can defer VPC documentation improvements / instructions for a separate PR.

Trying to launch a non-spot cluster now as one final test.

@JoshRosen
Copy link
Contributor Author

although to be fair it is not obvious to me whether the rapid destroy / launch combo is actually related to this issue.

I don't think it is, since I saw it after creating fresh security groups.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24652 has finished for PR 3737 at commit 0aa43cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Since this doesn't appear to have introduced any new issues, I'm going to merge this into master in order to resolve the known issue with launching spot clusters. Let's open a separate PR to handle testing / documentation of the VPC feature.

@asfgit asfgit closed this in c28083f Dec 20, 2014
@nchammas
Copy link
Contributor

I suspect the instance ID 'i-471a82b9' does not exist errors stem from tagging the instances in a separate call from the one that launches them. The time between launch and tagging is so small that occasionally we hit some AWS metadata replication slowness which manifests as instance does not exist.

The fix would be to either tag the instances in the same call that launches them if possible, and failing that to retry tagging them after a very short delay (e.g. 0.3s) if the initial attempt fails.

asfgit pushed a commit that referenced this pull request Dec 23, 2014
PR #3737 changed `spark-ec2` to automatically download boto from PyPI. This PR tell git to ignore those downloaded library files.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #3770 from nchammas/ignore-ec2-lib and squashes the following commits:

5c440d3 [Nicholas Chammas] gitignore downloaded EC2 libs
@nchammas
Copy link
Contributor

@JoshRosen There are a couple things that we may want to change about this new behavior in the future. The first is that --help now requires a download to work, which may be surprising to users. The second (unlikely issue, but nonetheless) is that people running spark-ec2 off a read-only mount won't be able to do so anymore.

@JoshRosen
Copy link
Contributor Author

@nchammas Thanks for raising those concerns. The --help issue might not be too hard to fix (we may be able to do some lazy-loading of boto). For read-only mounts, I don't see a great solution: I don't want to continue bundling a zip file in the Spark source, since the boto download is huge (even after compression). Maybe we could package it when making binary distributions, though.

@piskvorky
Copy link

@JoshRosen how exactly does boto get distributed to the cluster EC2 machines?

My app is failing and it seems to be connected to the fact that all EC2 nodes have boto 2.8.0 (as opposed to the driver machine, which has correctly downloaded and used 2.34.0).

@JoshRosen
Copy link
Contributor Author

@piskvorky, the boto on the EC2 instances itself should be provided through the AMI, AFAIK.

@JoshRosen JoshRosen deleted the update-boto branch September 17, 2015 21:32
@piskvorky
Copy link

Got it, thanks @JoshRosen!

By the way, would you be interested in a spark_ec2.py patch that exposes its functionality programatically as well?

Right now, it's a bit hard to use as part of a larger pipeline, with all the sys.exits and prints and raw_inputs sprinkled inside.

@piskvorky
Copy link

(sorry to hijack this issue, will open a new one if you think the patch is worth it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants