#Lab 2 - Interacting with AWS Athena#

During this lab, you will create a new database using the AWS Management Console, create a new table and run your first query. Then you will connect to your table using MySQLWorkbench as well as visualize the data using AWS QuickSight

^^^Please make sure your AWS Management Console is set on US-East Region (N.Virginia)^^^

Examine S3 bucket content

Open AWS Management Console Console and navigate to S3
Locate the nyc-yellow-trips bucket and navigate to csv folder. Note the amount of files and the format by downloading sample file

Create a Database and the Table

Open AWS Management Console Console
First, you will be creating a database. In the query edit box, type the following query and click Run Query
Remember to replace <USER_NAME> with your AWS username. (e.g. shaharf)

CREATE DATABASE IF NOT EXISTS <USER_NAME>

Now, you will create an external table based on CSV files by copying the following DDL query into the query window and click Run Query

CREATE EXTERNAL TABLE IF NOT EXISTS <USER_NAME>.yellow_trips_csv(
        pickup_timestamp BIGINT,
        dropoff_timestamp BIGINT,
        vendor_id STRING,
        pickup_datetime TIMESTAMP,
        dropoff_datetime TIMESTAMP,
        pickup_longitude FLOAT,
        pickup_latitude FLOAT,
        dropoff_longitude FLOAT,
        dropoff_latitude FLOAT,
        rate_code STRING,
        passenger_count INT,
        trip_distance FLOAT,
        payment_type STRING,
        fare_amount FLOAT,
        extra FLOAT,
        mta_tax FLOAT,
        imp_surcharge FLOAT,
        tip_amount FLOAT,
        tolls_amount FLOAT,
        total_amount FLOAT,
        store_and_fwd_flag STRING)
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ',' 
        LOCATION 's3://nyc-yellow-trips/csv/'

Navigate to Athena Catalog Manager and explore the database and table you've created
Finally, you are ready to run your first Athena query! Paste the following query into your query box, click Format Query and then Click Run Query. Note the runtime of the query and amount of data scanned.

SELECT from_unixtime(yellow_trips_csv.pickup_timestamp) as pickup_date, from_unixtime(yellow_trips_csv.dropoff_timestamp) as dropoff_date ,* FROM <USER_NAME>.yellow_trips_csv limit 100

Interact with AWS Athena using JDBC Driver

Make sure you have AWS CLI tool is installed - if not please install using the instructions here
Make sure you have SQL Workbench installed if not please install using the instructions here
Download the JDBC driver into the SQL Workbench directory from this link
Start SQL Workbench
Click Manage Drivers, create a new driver with the following settings
Name: Athena
Library: Select Athena JDBC Driver JAR (downloaded in the previous step)
Classname: com.amazonaws.athena.jdbc.AthenaDriver
Sample URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443
Press OK
Create a new connection with the following settings
Name: Athena
Driver: Athena
URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443
Username: Your Access Key (it is provided in the email we sent you earlier with your IAM user details)
Password: Your Secret Key (it is provided in the email we sent you earlier with your IAM user details)
Extended Properties:
Key: s3_staging_dir
Value: s3://your-full-name-jdbc-staging/
Click OK
Run the following query in the SQL Workbench query window

SELECT from_unixtime(yellow_trips_csv.pickup_timestamp) as pickup_date, 
       from_unixtime(yellow_trips_csv.dropoff_timestamp) as dropoff_date , * 
FROM <USER_NAME>.yellow_trips_csv limit 100

Interact with AWS Athena using AWS QuickSight

Open QuickSight from here
Click Manage Data
Click New Data Set
Select Athena as your data source
Enter the name your-name-yellow-trips-csv and click Create Data Source
Select the database with your username , the table yellow_trips_csv and then click Edit/Preview Data
Open the 'Tables' panel and click 'Switch to custom SQL tool'
Enter any name for the Query and copy the following SQL into the 'Custom SQL' text box

SELECT *, 
    greatest(0,trip_distance*3600.0/(dropoff_timestamp-pickup_timestamp)) as avg_speed,
    hour(from_unixtime(pickup_timestamp)) as hour
FROM srfrnk.yellow_trips_parquet
WHERE pickup_timestamp is not null and pickup_timestamp is not null and pickup_timestamp<>dropoff_timestamp

Click 'Finish'
Click Save & Visualize
Select Vertical Bar Chart from the Visual Types Panel
Drag hour into ‘X axis’ box
Drag avg_speed into ‘Value’ box, then click the down arrow of the box and select Average under the Aggregate

Interact with Athena from a NodeJS application using JDBC

Open AWS console at https://console.aws.amazon.com/athena/
Enter the following DDL query into the query window
Remember to replace <USER_NAME> with your AWS username. (e.g. shaharf)

CREATE EXTERNAL TABLE IF NOT EXISTS <USER_NAME>.yellow_trips_parquet(
         pickup_timestamp BIGINT,
         dropoff_timestamp BIGINT,
         vendor_id STRING,
         pickup_datetime TIMESTAMP,
         dropoff_datetime TIMESTAMP,
         pickup_longitude FLOAT,
         pickup_latitude FLOAT,
         dropoff_longitude FLOAT,
         dropoff_latitude FLOAT,
         rate_code STRING,
         passenger_count INT,
         trip_distance FLOAT,
         payment_type STRING,
         fare_amount FLOAT,
         extra FLOAT,
         mta_tax FLOAT,
         imp_surcharge FLOAT,
         tip_amount FLOAT,
         tolls_amount FLOAT,
         total_amount FLOAT,
         store_and_fwd_flag STRING
) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet/';

Make sure you have NodeJS and NPM installed on your machine. If not install from here.
Clone the workshop git repository into your machine and install:

git clone https://github.com/doitintl/athena-workshop.git
cd athena-workshop/jdbc
npm i

Run the application to query Athena from NodeJS and get the daily total fare paid:

npm start

After about 5 seconds the application should terminate after having printed output such as this:

Results: [{"vendor":"VTS","total":66.75000190734863},{"vendor":"CMT","total":14.200000286102295},{"vendor":"DDS","total":6.300000190734863}]

You have successfully completed Lab 2 ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lab2.md

lab2.md

Files

lab2.md

Latest commit

History

lab2.md

File metadata and controls