Skip to content
This repository has been archived by the owner on Jun 23, 2021. It is now read-only.

Analytics

Nestor Carvantes edited this page Oct 12, 2019 · 8 revisions

The analytics component forwards the data from the backend DynamoDB stream to S3 so that it can be analyzed with Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Data storage

Data is stored in S3. Files will look like the following:

Applications/year=2019/month=10/day=11/hour=16/realworld-serverless-application-analytic-Firehose-12J7YC29T8FAY-1-2019-10-11-16-58-58-c0068baf-ab5b-4a61-a9a7-1e100983c696.parquet

There are 2 important optimizations in the way the data is stored:

  1. Data is partitioned by time. The year=2019 style of prefixes are used to partition the data by time. Partitioning reduces the amount of data that has to be scanned to execute Athena queries, thus reducing the cost.

  2. Data is stored in .parquet files. Parquet is a columnar data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and enhanced performance to handle complex data in bulk. Parquet files are drastically smaller than JSON text files. Using parquet reduces storage and query costs.

The Analytics component sets up a Firehose delivery stream configured to output the files in the format described above.

Using Athena to analyze the data

To get started, navigate to the Athena console, and select the realworld_serverless_application_analytics_* database from the list.

Run the following query first to load new data partitions:

MSCK REPAIR TABLE applications;

Run a sample query:

SELECT detail.eventname,
         detail.dynamodb.keys.applicationid.s AS applicationid,
         detail.dynamodb.keys.userid.s AS userid,
         detail.dynamodb.newimage.author.s AS author,
         detail.dynamodb.newimage.description.s AS description
FROM applications;