-
Notifications
You must be signed in to change notification settings - Fork 46
Working with Myria and S3
Myria supports data ingest through a variety of mechanisms, but the most flexible and scalable is to first ingest data into Amazon's S3 cloud storage service, which offers the scalability, availability, and broad accessibility needed to host any dataset. Read on to see how to use Myria and S3 together.
If you're already familiar with S3 and Myria, here's a short example of how to load an S3 dataset into the Myria system for further processing:
A = load(
"http://seaflow.s3.amazonaws.com/opp_vct_all.csv",
csv(schema(
cruise:string, file_time:string, particle:int,
time:int, pulse_width:int, D1:int, D2:int, fsc_small:int, fsc_perp:int, fsc_big:int, pe:int, chl_small:int, chl_big:int, pop:string))
);
store(A, armbrustlab:seaflow:opp_vct_all, [cruise, file_time, particle]);
S3 (Simple Storage Service) is Amazon's Internet-based key-value store. You can think of a key-value store as a filesystem, where the key is the filename, and the value is the contents of the file. One difference is that S3 doesn't really have directories (although people often put slashes in their keys to simulate directories). Another is that you have to access S3 via its HTTP API, rather than normal filesystem commands (there are hacks to make it look like a local filesystem, but they're not recommended).
S3 is organized into buckets, which are namespaces with globally unique names, owned by a particular Amazon account. Buckets can be mapped to Internet subdomains, allowing you to address an S3 object with an ordinary HTTP URL, like http://seaflow.s3.amazonaws.com/opp_vct_all.csv. A more common URL syntax, though it only works with S3-specific tools, is s3://<bucket-name>/<object-key>
, such as s3://seaflow//opp_vct_all.csv
. Although buckets must have globally unique names, they're always associated with a particular geographic region where the bucket's data is stored, which must be specified at bucket creation. In the UW's case, the best region to specify is us-west-2
, which corresponds to Amazon's data centers in the Portland area.
Buckets can have various properties associated with them, such as object versioning (good insurance against accidental deletion or data corruption), automatic archival to Amazon's "cold storage" service Glacier, or access controls. You'll want to be sure to set a new bucket's access controls appropriately when it's created, unless you don't mind anyone being able to read and modify its contents. (Sophisticated access control requirements might require creating a bucket policy, which is a JSON document written in Amazon's proprietary access control language.)
The easiest way to create and view buckets is via the AWS Console: https://uwescience.signin.aws.amazon.com/console (this requires you to have user credentials under the uwescience
AWS account). Once you've created your new bucket and set its access controls, you can start filling it with objects!
You can interact with your objects stored in S3 in several different ways: via the AWS Console, via a web browser extension (like S3Fox) or standalone GUI (like Cyberduck or S3 Browser), via the AWS CLI, or via the AWS API libraries for various languages (such as boto for Python). In the examples I'll give, we'll use the AWS CLI, since it's easy to use from the command line or shell scripts.
Instructions are here, but if you have a Mac (with Homebrew installed), it's as simple as brew install awscli
. You'll need to run aws configure
and enter your user credentials before you can proceed.
Say you have a file opp_vct_all.csv
in your home directory, and you'd like to upload it to S3 (maybe for backup, or to share it with others, or to ingest it into Myria). Here's how you can do that using the AWS CLI (I'll assume throughout that you're using the seaflow
bucket, owned by the uwescience
account):
aws s3 cp ~/opp_vct_all.csv s3://seaflow/opp_vct_all.csv
That's it!
Downloading a file to your local filesystem is just as simple:
aws s3 cp s3://seaflow/opp_vct_all.csv ~/opp_vct_all.csv
Since S3 supports addressing objects as ordinary HTTP URLs, and since Myria supports loading files from HTTP, ingesting a file in S3 into Myria is simple:
A = load(
"http://seaflow.s3.amazonaws.com/opp_vct_all.csv",
csv(schema(
cruise:string, file_time:string, particle:int,
time:int, pulse_width:int, D1:int, D2:int, fsc_small:int, fsc_perp:int, fsc_big:int, pe:int, chl_small:int, chl_big:int, pop:string))
);
store(A, armbrustlab:seaflow:opp_vct_all, [cruise, file_time, particle]);
If you want to work with a file locally after ingesting it into Myria, and you're working on a cluster with the Hadoop Distributed File System (HDFS) installed, I would recommend importing it into HDFS first:
hadoop fs -put ~/opp_vct_all.csv /datasets/seaflow/
This lets you spread out the file across the cluster so you don't exhaust your disk quota on one machine, and it replicates it as well, so you can recover your file if one machine fails. It also makes ingesting large files more reliable, since the file is being downloaded from the local network rather than the Internet. The query just needs to be modified to use an HDFS URL instead of HTTP:
A = load(
"hdfs://vega.cs.washington.edu:8020/datasets/seaflow/opp_vct_all.csv",
csv(schema(
cruise:string, file_time:string, particle:int,
time:int, pulse_width:int, D1:int, D2:int, fsc_small:int, fsc_perp:int, fsc_big:int, pe:int, chl_small:int, chl_big:int, pop:string))
);
store(A, armbrustlab:seaflow:opp_vct_all, [cruise, file_time, particle]);