Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/ec2-scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,6 @@ If you have a patch or suggestion for one of these limitations, feel free to

# Accessing Data in S3

Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. You will also need to set your Amazon security credentials, either by setting the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before your program or through `SparkContext.hadoopConfiguration`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. To provide AWS credentials for S3 access, launch the Spark cluster with the option `--copy-aws-credentials`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).

In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.
2 changes: 2 additions & 0 deletions ec2/deploy.generic/root/spark-ec2/ec2-variables.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,5 @@ export HADOOP_MAJOR_VERSION="{{hadoop_major_version}}"
export SWAP_MB="{{swap}}"
export SPARK_WORKER_INSTANCES="{{spark_worker_instances}}"
export SPARK_MASTER_OPTS="{{spark_master_opts}}"
export AWS_ACCESS_KEY_ID="{{aws_access_key_id}}"
export AWS_SECRET_ACCESS_KEY="{{aws_secret_access_key}}"
10 changes: 10 additions & 0 deletions ec2/spark_ec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,9 @@ def parse_args():
parser.add_option(
"--additional-security-group", type="string", default="",
help="Additional security group to place the machines in")
parser.add_option(
"--copy-aws-credentials", action="store_true", default=False,
help="Add AWS credentials to hadoop configuration to allow Spark to access S3")

(opts, args) = parser.parse_args()
if len(args) != 2:
Expand Down Expand Up @@ -714,6 +717,13 @@ def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules):
"spark_master_opts": opts.master_opts
}

if opts.copy_aws_credentials:
template_vars["aws_access_key_id"] = conn.aws_access_key_id
template_vars["aws_secret_access_key"] = conn.aws_secret_access_key
else:
template_vars["aws_access_key_id"] = ""
template_vars["aws_secret_access_key"] = ""

# Create a temp directory in which we will place all the files to be
# deployed after we substitue template parameters in them
tmp_dir = tempfile.mkdtemp()
Expand Down