Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for launching in private vpc #123

Closed
wants to merge 8 commits into from

Conversation

jperezdiaz
Copy link

This PR makes the following changes:

Fixes #14 .

@BenFradet
Copy link
Contributor

BenFradet commented May 26, 2016

You can run py.test tests/test_static.py to check style issues as detailed in the test guide.

command="""
set -e

fullname=`hostname`.ec2.internal
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core.py should not have any logic that's specific to a provider like EC2. If someone added GCE support to Flintrock tomorrow, we'd want to be able to reuse all the logic in core.py mostly as-is.

@nchammas
Copy link
Owner

Thanks for taking this on @jorgito1167!

This looks like a good start. I think this PR can be made simpler and better by making vpc_is_private a @property of the EC2Cluster class. That will eliminate the need for many of the little changes that have been made to helper methods in ec2.py.

flintrock_client_ip = (
urllib.request.urlopen('http://checkip.amazonaws.com/')
.read().decode('utf-8').strip())
if use_private_vpc:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the host running Flintrock does not have a public ip, the past method for getting the ip address won't return the private ip. The current solution assumes that the user will always are launch a cluster into a private subnet from a host with no public ip. Is there a better way of handling this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think whether the Flintrock client has a private IP is distinct from whether the cluster is in a private VPC (though often the two will go together). So we need a way of determining what IP to use for the Flintrock client that doesn't depend on the cluster.

Perhaps we can simply query checkip first, and if that fails fallback to gethostbyname()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it does not fail. It simply returns a public IP as seen by the aws checkip server.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, and that won't be the IP that the cluster sees when Flintrock tries to connect, if both the client and the cluster are together on a private subnet.

Perhaps then we should always authorize both addresses, public and private?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be dangerous since the IP that results from querying the checkip server is common to all of the computers under the NAT instance. I'm not sure if it will allow access from other computers.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fairly innocuous to me, since those other machines would be under your account. Or are you saying machines from different AWS accounts may share the same public IP address?

Copy link
Author

@jperezdiaz jperezdiaz May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to be honest. I would like to think that it doesn't happen but it would be great to come up with a solution that doesn't authorize both IPs. For the purposes of this PR we can just go with authorizing both. Later we can include an option to authorize only a specific security group or list or ip addresses that the user provides.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the relevant docs here:

It sounds like your VPC has an internet gateway attached, otherwise the call to checkip would fail, or perhaps would return a private address.

Even with the gateway attached, from my reading of the docs, it sounds like only instances from the same VPC can ever get the same public IP address. So I don't think it's an issue to authorize both addresses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll add a try except for handling the case when it fails.

@nchammas
Copy link
Owner

Hey @jperezdiaz (did you recently change your username?), I took a quick look through your latest changes and things generally look good.

There are still some open items, though:

  • core.py still includes some logic that is specific to EC2. We need to factor that out, or possibly do away with it entirely. Do you have any suggestions in that regard? Is instance.private_dns_name not giving us something usable "out-of-the-box"?
  • I just tried launching a cluster off of this PR using my current config with a public VPC and it failed. Well, the launch technically succeeded but Spark couldn't come up. It looks like the master had trouble binding to an address. Does this work for you?

@jperezdiaz
Copy link
Author

jperezdiaz commented May 30, 2016

Yes, I changed my username recently. Sorry for the confusion.

  • It seems like public VPC did use not have problems with the hostname. We can condition the /etc/hosts script on the subnet being private. I also like the idea of using getting the instance private DNS to replace the hardcoded .ec2.internal.
  • How can I test if the master successfully bound to an address? I'm pretty sure I'm having a similar problem even with the private VPC.

@nchammas
Copy link
Owner

How can I test if the master successfully bound to an address? I'm pretty sure I'm having a similar problem even with the private VPC.

When you launch a cluster and the master fails to start properly, the Spark health check should show 0 workers. Then, if you login to the cluster and start a shell (either spark-shell or pyspark), you'll also get an error. It should be pretty obvious.

@jperezdiaz
Copy link
Author

It is non-trivial for me to test in a public vpc. However, it seems to launch fine in the private vpc. The health check reports 1 worker and I'm able to start pyspark. Could the problem be due to the change in the /etc/hosts file? In that case, we can just condition the change on the subnet being private.

On an unrelated note, the problem I see is that when I start pyspark I get the following message for the Spark UI:

16/05/31 17:55:17 INFO SparkUI: Started SparkUI at http://<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>:4040

@nchammas
Copy link
Owner

nchammas commented Jun 1, 2016

Could the problem be due to the change in the /etc/hosts file? In that case, we can just condition the change on the subnet being private.

I'm not sure, but I am suspicious of that change.

On an unrelated note, the problem I see is that when I start pyspark I get the following message for the Spark UI:

You're seeing this when you launch in a private VPC?

@jperezdiaz
Copy link
Author

  • Is there a good way of figuring out the private DNS of the instances from inside the provision_node function? The instances attribute is specific to the EC2 cluster so it should not be used. I could get the slave_ips list, get the index of the current ip and then use it to index the private DNS list. Otherwise, I can create a method in for the FlintrockCluster object that does this for us.
  • Yes, the Spark UI problem happens when I launch in a private VPC.

@nchammas
Copy link
Owner

nchammas commented Jun 4, 2016

Is there a good way of figuring out the private DNS of the instances from inside the provision_node function?

I think provision_node() and the rest of the code in core.py should ideally just know about IP addresses, and not distinguish whether they are public or private.

Yes, the Spark UI problem happens when I launch in a private VPC.

Does Spark work otherwise?

I'm trying to replicate your setup so I can help you test this. Can you lay out the VPC, subnet, routing table, etc. you have and how they are configured so I can setup a parallel environment?

Ideally we should capture this setup as code and use it in an acceptance test (e.g. setup private VPC, test launch/Spark/HDFS, tear down cluster and VPC), but that's probably a bit much for now. We can add that in later, unless you feel like having a go at it now.

@jperezdiaz
Copy link
Author

I like the idea of setting up the test. I'm really busy this week but I'll try to implement it once I get some time.

@nchammas
Copy link
Owner

Hey @jperezdiaz are you interested in updating this PR?

Looking through the history, it looks like we agreed on the basic approach you took here, but there were 2 unresolved issues:

  1. EC2-specific logic was added to core.py. core.py should be completely provider-agnostic.
  2. You were experiencing some issues with the launch and/or Spark UI.

If you aren't planning to update it anytime soon, we can close the PR for now and you can revisit it when you are ready.

@nchammas
Copy link
Owner

nchammas commented Sep 2, 2016

Closing this PR. Feel free to open a new one if you are interested in continuing this work!

@nchammas nchammas closed this Sep 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants