-
Notifications
You must be signed in to change notification settings - Fork 15
AWS EC2 QuickStart
The steps below will guide you through setting up and configuring AWS EC2 for use with APT. After you are done with the setup, APT will be able to perform deep learning training and inference in the cloud.
EC2 instances
You can think of an EC2 instance (as configured below) as a computer with a GPU in the cloud that is under your control. It is your "GPU in the cloud". An instance has:
- An "instance ID", which uniquely identifies it
- A (public) IP address so you can ssh into it, examine its processes and filesystem contents
- A state that is either "RUNNING", "STOPPED", or "TERMINATED"
- If the state is RUNNING, your instance is either actively computing or ready to do so. You are paying about $1/hour.
- A state of STOPPED is analogous to having your desktop workstation in "hibernation" or "sleep" mode. No computations are being carried out, but any previous computations (trained models) etc are saved in the remote filesystem. You are paying a very low price based on the amount of disk storage ("EBS" storage in AWS-speak) that has been allocated for your instance. For 50GB of storage (the current default), you are paying $5/month.
- A state of TERMINATED means your that instance has ceased to exist and returned to the ephemeral protoplasm of the cloud! Any and all state is destroyed at termination, so before you terminate an instance, make sure you tell APT to download all your trained models. As you might guess, terminated instances do not cost anything.
Using AWS with APT
Since STOPPED instances are inexpensive, APT is currently designed around the idea that you will create an EC2 instance and "leave it up" for a stretch of time (eg weeks, maybe even a month or two) while you do a bunch of work for a project. During this time, you can iteratively label, train, and track within APT over multiple sessions. Between active working sessions, your instance is STOPPED, and all APT state including trained models and movies/trxfiles to be tracked is preserved. After a time, you will reach a stopping point for the project, and you can instruct APT to download your trained models to your local workstation. (Tracking results are currently downloaded immediately after each tracking session.) When the download is complete, you can terminate the EC2 instance.
APT automates starting/stopping your instance (*ultimately; see below), starting/stopping training and tracking processes, and so on. However, it is inevitable that APT will at times become "disconnected" from what is happening in the cloud. At these times, manually managing the EC2 instance by eg ssh-ing into the instance and killing processes, or manually stopping an instance via the AWS dashboard will be necessary.
Be proactive and check on your instance!
(*) TODO: Currently starting/stopping instances is still the user's responsibility, but it appears this could be handled by APT without much trouble.
-
Create an AWS account and set up the payment system. This is the root account for AWS. You can create an account here https://aws.amazon.com/.
-
Create an AWS user (or users) from within the root account. (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_console).
First, choose a User name and set the Access type to Programmatic Access.
To set permissions, select Attach existing policies directly, and add Administrator Access and IAMUserChangePassword permissions.
-
Login to the AWS user account using the console.
-
Create Access Keys: To create a new Access Key pair for an IAM user, open the IAM console (https://console.aws.amazon.com/iam) or look for IAM under Services - Security, Identity & Compliance on your console home page). Click Users in the Details pane, click the appropriate IAM user, and then click Create Access Key on the Security Credentials tab.
You can save the access keys anywhere (I save it in my .ssh folder).
Access keys are sort of pair of login and password that are required to start an instance.
- On Linux, change the permission to read only using “chmod 400 path/to/access_key.csv”
- On Windows, ...
-
Create an ssh Key Pair. For this, go to the EC2 console (https://console.aws.amazon.com/ec2) or look for EC2 under Services - Compute on your console home page). On the left, click on Key Pairs under Network & Security and then use the Create Key Pair button. Save the key pair in your ~/.ssh folder and change the permissions to read only using “chmod 400 ~/.ssh/key_pair.pem”. This ssh Key Pair is different than the above Access Keys. The ssh Key Pair is used to ssh into a running instance while Access Keys are required to create an instance. Alternatively, if you already have a key pair, you can import your public key by clicking Import Key Pair.
-
Increase the limit of the number of instances you can create of type p2.xlarge. p2.xlarge is the basic instance type that has the GPU useful for APT backend computations. Its cost is 0.9$/hr. Go to the EC2 console (https://console.aws.amazon.com/ec2), use the Limits option on the left and request a limit increase for p2.xlarge. My suggestion is to increase the limit to at least 2 instances, so that you can run stuff on one instance and test on the other instance. This step usually takes a day or so because Amazon does verifications. You can, however, continue with the rest of the set up.
-
Install the AWS CLI (https://docs.aws.amazon.com/cli/latest/userguide/installing.html).
-
Configure the AWS CLI (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-quick-configuration). This is where you’ll need to enter the Access Keys. (For JRC users, the default region is us-east-1, Northern Virginia). Chose the default output format as json.
-
Check your CLI setup using “aws ec2 describe-regions --output table”. The output should be:
----------------------------------------------------------
| DescribeRegions |
+--------------------------------------------------------+
|| Regions ||
|+-----------------------------------+------------------+|
|| Endpoint | RegionName ||
|+-----------------------------------+------------------+|
|| ec2.ap-south-1.amazonaws.com | ap-south-1 ||
|| ec2.eu-west-3.amazonaws.com | eu-west-3 ||
|| ec2.eu-west-2.amazonaws.com | eu-west-2 ||
- Create a security group: The security group defines the basic firewall for the instance that will be launched using the CLI. Below we create a security group “apt_dl” which allows instances to accept ssh connections from any IP address. Important: Currently you must name your security group "apt_dl" as APT will expect this naming!
$ aws ec2 create-security-group --group-name apt_dl --description "Basic security group for APT deep learning"
{
"GroupId": "sg-b018ced5"
}
$ aws ec2 authorize-security-group-ingress --group-name apt_dl --protocol tcp --port 22 --cidr 0.0.0.0/0
You do not need to do this if another user on your account has already created this security group.
- Once the limits on the number of instances is increased, check that you can launch and ssh into an instance using the EC2 console.
Use the Launch Instance button on your EC2 console page (https://console.aws.amazon.com/ec2).
Select the AMI AMI-APT (ami-0168f57fb900185e1) that we created. You can find this by searching for 0168f57fb900185e1 within Community AMIs.
Next, select p2.xlarge as the instance type.
Then click Review and launch. It will then prompt you for a key pair. Select the ssh key pair you created or uploaded previously.
After launching, go to the EC2 console and use the Instances option to see the information about your instances. If you select the recently launched instance in the table, you’ll see all the information regarding this instance in particular its IP address (IPv4 Public IP). Use the IP address to ssh into the machine using “ssh -i ~/.ssh/<key_pair.pem> ubuntu@”. Replace <key_pair.pem> with the ssh key pair you created and downloaded previously. If instead you imported an existing key into Amazon, you can replace this with your private key file.
- Once you are done with ssh and ready to kill the instance, use the Actions button at the top of the instances page, to Terminate the machine from the Instance State menu.
Depending on your platform, APT also requires some utilities in order to communicate with EC2.
- Windows
- ssh and scp. A version of these utilities is distributed with a typical Windows Git installation at
c:\Program Files\Git\usr\bin\ssh.exe
and...\scp.exe
. Currently these paths are hardcoded so these utilities must live precisely in those locations! - (future) For ssh and scp, just go with what's on the PATH.
- (future) maybe support rsync
- The utility
certUtil.exe
is used to compute hashes over files. This should be present in standard Windows installs. To confirm, open a cmd window (terminal) in Windows and typewhere certUtil
.
- ssh and scp. A version of these utilities is distributed with a typical Windows Git installation at