-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it #3737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… PyPi instead of packaging it
|
Patch looks good to me, but I'll try to test it out later this week. |
|
I think that the main risk of this patch is that boto has deprecated / removed / changed functionality that we rely on, but that we won't notice this until users run commands that exercise those branches (yay dynamic languages!). I was able to launch a basic spot cluster, so I think that we should be in pretty good shape. I could try running a code coverage tool on this while I launch the cluster to see if there are any branches that I've missed, then just look through the Boto docs to see whether that functionality's still present. OTOH, if Boto is good about maintaining backwards compatibility, then I guess it's probably fine to merge this and wait to see if we hit problems. |
|
If you run Python with the Using a code coverage tool to make sure we don't miss any branches sounds like a good idea. What tool would you use to do that for Python? |
|
Ah, enabling warnings is a good idea. I could just add that flag to the For Python, I usually use |
|
That sounds like a good idea. That way if we change something in the future to rely on a deprecated feature, we'll immediately notice during testing. |
|
Test build #24602 has finished for PR 3737 at commit
|
/Users/joshrosen/Documents/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py:583: PendingDeprecationWarning: The current get_all_instances implementation will be replaced with get_all_reservations.
Quoting from the Boto docs:
def get_all_instances(self, instance_ids=None, filters=None, dry_run=False,
max_results=None):
"""
Retrieve all the instance reservations associated with your account.
.. note::
This method's current behavior is deprecated in favor of
:meth:`get_all_reservations`. A future major release will change
:meth:`get_all_instances` to return a list of
:class:`boto.ec2.instance.Instance` objects as its name suggests.
To obtain that behavior today, use :meth:`get_only_instances`.
|
The deprecation warning idea was good; it turns out that there's a pending deprecation which will change the semantics of one of the methods we use, so I upgraded the code according to the documentation's suggestion. For obtaining code coverage metrics, I applied the following change to the Bash script: diff --git a/ec2/spark-ec2 b/ec2/spark-ec2
index 3abd3f3..4e3fc69 100755
--- a/ec2/spark-ec2
+++ b/ec2/spark-ec2
@@ -22,4 +22,6 @@
#+ the underlying Python script.
SPARK_EC2_DIR="$(dirname $0)"
-python -Wdefault "${SPARK_EC2_DIR}/spark_ec2.py" "$@"
+
+export PYTHONWARNINGS="default"
+coverage run -a "${SPARK_EC2_DIR}/spark_ec2.py" "$@"The With the workloads that I ran (launching spot clusters, stopping and starting a cluster, destroying / creating security groups, logging in, canceling spot instance requests), I got to 80% line coverage; most of the lines that I missed were error-handling code. Here's a link to an ASCII coverage report, produced with According to the docs:
As you can see, the coverage is pretty good. Therefore, I'd be comfortable merging this PR now. |
|
Test build #24630 has finished for PR 3737 at commit
|
|
Nice work, Josh. I took a quick look at the coverage report. It looks like most of it is covered. If we want to be extra thorough, I think there are a few more things relevant to boto that were not hit. (Or perhaps you hit them but they were not captured in this particular report file.)
By the way, I noticed some unused (and unreachable) code at L653. We can probably delete it. |
|
I think a few of the skipped lines themselves are okay, since L354, is similar to some calls right below it that were actually run. On the other hand, the fact that we didn't run the VPC code path means that we didn't end up calling This is one of those cases where line coverage is kind of a minimum standard but not the end-all of coverage metrics, since it doesn't capture the space of different configurations / arguments that we call Boto with. Since I've got to launch a cluster anyways, I'll try spinning up one more in a VPC just to be safe. Good call on the |
|
Hmm, I tried running ./spark-ec2 \
-t m3.xlarge \
-s 2 \
-k joshrosen \
-i /Users/joshrosen/.ssh/joshrosen.pem \
--ebs-vol-size 10 \
--ebs-vol-num 2 \
-r us-west-2 \
--zone us-west-2a \
--spark-version 1.1.0 \
--swap 2048 \
--vpc-id vpc-0778a362 \
--subnet-id subnet-ebcb768e \
launch josh-benchmarking3Looks like it hit some sort of race-condition: I've tried passing the |
|
Yes, I've been getting this type of error with 1.1.1. I haven't had time to look into it, but I suspect there is some subtle thing about the EC2 API that we are not honoring which occasionally leads to this problem. It could also just be general AWS flakiness (e.g. due to metadata like available instances needing to be replicated and whatnot) that we need to account for. I don't think this is related to the changes in this PR. |
|
I'm curious: After calling |
|
Not sure, since I ended up accidentally launching my instances on a private subnet of my VPC so |
|
Interesting - I'm just now hitting the same error: @nchammas, you are correct that in my case the slaves aren't coming up with tags after |
|
Hmm, so I guess I don't know how to properly configure AWS VPCs, since I've been having trouble even being able to manually SSH into EC2 instances launched in my VPC. Maybe I can defer VPC documentation improvements / instructions for a separate PR. Trying to launch a non-spot cluster now as one final test. |
I don't think it is, since I saw it after creating fresh security groups. |
|
Test build #24652 has finished for PR 3737 at commit
|
|
Since this doesn't appear to have introduced any new issues, I'm going to merge this into |
|
I suspect the The fix would be to either tag the instances in the same call that launches them if possible, and failing that to retry tagging them after a very short delay (e.g. 0.3s) if the initial attempt fails. |
PR #3737 changed `spark-ec2` to automatically download boto from PyPI. This PR tell git to ignore those downloaded library files. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3770 from nchammas/ignore-ec2-lib and squashes the following commits: 5c440d3 [Nicholas Chammas] gitignore downloaded EC2 libs
|
@JoshRosen There are a couple things that we may want to change about this new behavior in the future. The first is that |
|
@nchammas Thanks for raising those concerns. The |
|
@JoshRosen how exactly does My app is failing and it seems to be connected to the fact that all EC2 nodes have boto 2.8.0 (as opposed to the driver machine, which has correctly downloaded and used 2.34.0). |
|
@piskvorky, the |
|
Got it, thanks @JoshRosen! By the way, would you be interested in a spark_ec2.py patch that exposes its functionality programatically as well? Right now, it's a bit hard to use as part of a larger pipeline, with all the |
|
(sorry to hijack this issue, will open a new one if you think the patch is worth it) |
This patch upgrades
spark-ec2's Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in
SPARK_EC2_DIR/lib, similar to what we do in thesbt/sbtscript. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.I've tested this with Python 2.6, too.