#Lab 5 - Converting to Columnar Formats# During this lab you will learn how to convert row based file formats like CSV and JSON into columnar formats such as ORC and PARQUET in real time as they are fed through a Kinesis Stream

^^^Please make sure your AWS Management Console is set on US-East Region (N.Virginia)^^^

Create a Kinesis Stream

Open the Kinesis Console at this link
For Stream Name use your username
For Number of shards put 1
Click Create Stream

Connect to EMR

You can reuse the EMR cluster created in the lab 4.
Download the file job.py into your local machine
Download the file generate.py into your local machine
Edit the file job.py - set targetPath(line 12) to the path to your S3 bucket.
Edit the file job.py - set kinesisStreamName(line 13) to the name of the Kinesis Stream you created.
Edit the file generate.py - set kinesisStreamName(line 14) to the name of the Kinesis Stream you created.
In a bash terminal run the following commands. Replace <PEM FILE> and <EMR MASTER> with the same values used when you connected via SSH

scp -i <PEM FILE> job.py hadoop@<EMR MASTER>:~/job.py
scp -i <PEM FILE> generate.py hadoop@<EMR MASTER>:~/generate.py

SSH Connect into the EMR master as you have in lab 4.
Set AWS credentials for Kinesis API. Replace <your-access-key> and <your-secret-key>

export AWS_ACCESS_KEY_ID=<your-access-key>
export AWS_SECRET_KEY=<your-secret-key>

Run the job to push data into Kinesis Stream

spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 generate.py

Run the job to accept Kinesis Stream messages and save those into PARQUET files at your S3 bucket

spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 job.py

Wait until the job kicks-in and watch your new PARQUET files as they are written into S3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lab5.md

lab5.md

Files

lab5.md

Latest commit

History

lab5.md

File metadata and controls