#Lab 2 - Interacting with AWS Athena#
During this lab, you will create a new database using the AWS Management Console, create a new table and run your first query. Then you will connect to your table using MySQLWorkbench as well as visualize the data using AWS QuickSight
^^^Please make sure your AWS Management Console is set on US-East Region (N.Virginia)^^^
Examine S3 bucket content
-
Open AWS Management Console Console and navigate to S3
-
Locate the
nyc-yellow-trips
bucket and navigate tocsv
folder. Note the amount of files and the format by downloading sample file
Create a Database and the Table
-
First, you will be creating a database. In the query edit box, type the following query and click
Run Query
-
Remember to replace
<USER_NAME>
with your AWS username. (e.g. shaharf)
CREATE DATABASE IF NOT EXISTS <USER_NAME>
- Now, you will create an external table based on CSV files by copying the following DDL query into the query window and click
Run Query
CREATE EXTERNAL TABLE IF NOT EXISTS <USER_NAME>.yellow_trips_csv(
pickup_timestamp BIGINT,
dropoff_timestamp BIGINT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
pickup_longitude FLOAT,
pickup_latitude FLOAT,
dropoff_longitude FLOAT,
dropoff_latitude FLOAT,
rate_code STRING,
passenger_count INT,
trip_distance FLOAT,
payment_type STRING,
fare_amount FLOAT,
extra FLOAT,
mta_tax FLOAT,
imp_surcharge FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
total_amount FLOAT,
store_and_fwd_flag STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://nyc-yellow-trips/csv/'
-
Navigate to Athena Catalog Manager and explore the database and table you've created
-
Finally, you are ready to run your first Athena query! Paste the following query into your query box, click
Format Query
and thenClick Run Query
. Note the runtime of the query and amount of data scanned.
SELECT from_unixtime(yellow_trips_csv.pickup_timestamp) as pickup_date, from_unixtime(yellow_trips_csv.dropoff_timestamp) as dropoff_date ,* FROM <USER_NAME>.yellow_trips_csv limit 100
Interact with AWS Athena using JDBC Driver
-
Make sure you have AWS CLI tool is installed - if not please install using the instructions here
-
Make sure you have SQL Workbench installed if not please install using the instructions here
-
Download the JDBC driver into the SQL Workbench directory from this link
-
Start SQL Workbench
-
Click
Manage Drivers
, create a new driver with the following settings -
Name: Athena
-
Library: Select Athena JDBC Driver JAR (downloaded in the previous step)
-
Classname: com.amazonaws.athena.jdbc.AthenaDriver
-
Sample URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443
-
Press OK
-
Create a new connection with the following settings
-
Name: Athena
-
Driver: Athena
-
URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443
-
Username:
Your Access Key
(it is provided in the email we sent you earlier with your IAM user details) -
Password:
Your Secret Key
(it is provided in the email we sent you earlier with your IAM user details) -
Extended Properties:
-
Key: s3_staging_dir
-
Value:
s3://your-full-name-jdbc-staging/
-
Click OK
-
Run the following query in the SQL Workbench query window
SELECT from_unixtime(yellow_trips_csv.pickup_timestamp) as pickup_date,
from_unixtime(yellow_trips_csv.dropoff_timestamp) as dropoff_date , *
FROM <USER_NAME>.yellow_trips_csv limit 100
Interact with AWS Athena using AWS QuickSight
-
Open QuickSight from here
-
Click Manage Data
-
Click New Data Set
-
Select Athena as your data source
-
Enter the name
your-name-yellow-trips-csv
and clickCreate Data Source
-
Select the database with your username , the table
yellow_trips_csv
and then clickEdit/Preview Data
-
Open the 'Tables' panel and click 'Switch to custom SQL tool'
-
Enter any name for the Query and copy the following SQL into the 'Custom SQL' text box
SELECT *,
greatest(0,trip_distance*3600.0/(dropoff_timestamp-pickup_timestamp)) as avg_speed,
hour(from_unixtime(pickup_timestamp)) as hour
FROM srfrnk.yellow_trips_parquet
WHERE pickup_timestamp is not null and pickup_timestamp is not null and pickup_timestamp<>dropoff_timestamp
- Click 'Finish'
- Click
Save & Visualize
- Select
Vertical Bar Chart
from theVisual Types
Panel - Drag
hour
into ‘X axis’ box - Drag
avg_speed
into ‘Value’ box, then click the down arrow of the box and selectAverage
under theAggregate
Interact with Athena from a NodeJS application using JDBC
-
Open AWS console at https://console.aws.amazon.com/athena/
-
Enter the following DDL query into the query window
-
Remember to replace
<USER_NAME>
with your AWS username. (e.g. shaharf)
CREATE EXTERNAL TABLE IF NOT EXISTS <USER_NAME>.yellow_trips_parquet(
pickup_timestamp BIGINT,
dropoff_timestamp BIGINT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
pickup_longitude FLOAT,
pickup_latitude FLOAT,
dropoff_longitude FLOAT,
dropoff_latitude FLOAT,
rate_code STRING,
passenger_count INT,
trip_distance FLOAT,
payment_type STRING,
fare_amount FLOAT,
extra FLOAT,
mta_tax FLOAT,
imp_surcharge FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
total_amount FLOAT,
store_and_fwd_flag STRING
) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet/';
-
Make sure you have NodeJS and NPM installed on your machine. If not install from here.
-
Clone the workshop git repository into your machine and install:
git clone https://github.com/doitintl/athena-workshop.git
cd athena-workshop/jdbc
npm i
- Run the application to query Athena from NodeJS and get the daily total fare paid:
npm start
- After about 5 seconds the application should terminate after having printed output such as this:
Results: [{"vendor":"VTS","total":66.75000190734863},{"vendor":"CMT","total":14.200000286102295},{"vendor":"DDS","total":6.300000190734863}]
You have successfully completed Lab 2 ;-)