- Lotame Data Stream files: downloaded, uncompressed, and converted into csv format
- Access to the Amazon AWS CLI, Amazon S3, and Amazon Redshift
- A burning desire for deeper analytical understandings of your data
- A to read this data, and do some really basic data analysis.
NOTE: If this is the first recipe you're following make sure you have completed the set up instructions found in the lib directory read me. You need to add your Lotame credentials to the lotame.properties file and put that file in the root of your home directory, ~/. You only need to do this step once in order to run all these recipes.
If you get an error stating: No section: 'default'
then you probably didn't take care of the lotame.properties file first.
First, if you've got uncompressed Lotame Data Stream files in csv format on a local machine or stored somewhere that is not Amazon S3, here is how to rectify that situation.
aws cp --recursive /directory/containing/my/uncompressed/data/stream/csv/files s3://my-bucket-name/and/where/i/want/my/data/stream/files/to/be/stored
There, now your uncompressed Lotame Data Stream files are safe, sound, and secure in an Amazon S3 bucket, and ready for some in depth analysis.
There are a few ways you can import your csv files into a Redshift database, but running this command on your Redshift cluster is the easiest way to go by far.
COPY lotame.data_stream (id varchar(50),type carchar(10),region varchar(2),country varchar(10),client_id smallint,behavior_id smallint,timestamp bigint,action varchar(1))
FROM 's3://my-bucket-name/and/where/i/want/my/data/stream/files/to/be/stored/'
CREDENTIALS 'aws_access_key_id=mYACcessKeY;aws_secret_access_key=MYSEcrEtkeY'
Note: Redshift doesn't like csv headers, so you'll want to remove the first line from those csv files if going this route. If this is a typical route that you tend to go, we recommend modifying the
JsontoCsvConverter.py
script to not write the csv header for each file.
Hint: To remove the first line from a file on a Unix system, try
tail -n +2 filename.csv > filename2.csv