FDA Drug Labels Data

The FDA provides a wealth of large data sets via the OpenFDA website and APIs. We will leverage one of these data sets in particular; the Drug Labels.

Raw Data

The FDA provides access to all of the OpenFDA data via a public S3 bucket. We can pull this data in directly using the following command:

aws s3 sync s3://download.open.fda.gov/2019-08-10/drug/label s3://$STACK_NAME-raw/fda-source/2019-08-10/drug/label

Windows: aws s3 sync s3://download.open.fda.gov/2019-08-10/drug/label s3://%STACK_NAME%-raw/fda-source/2019-08-10/drug/label

Note: The above command assumes that you have the STACK_NAME variable set in your terminal from our setup steps.

Land Data

Objectives for this section:

Take the raw zip "parts" as an input
deflate the files and save as gzip compressed json files - this puts them in a format we can interact with using Spark
Read the compressed json files as a DataFrame
Select the property we want to extract from the large object
explode the list of results creating a new structure where each drug is now a row
Land the new optimized data set as gzip compressed json files

Steps:

Connect to your EMR Cluster (as described in 02_EMR_Cluster)
Run pyspark --driver-memory 10G --executor-memory 10G --executor-cores 1
Open fda.01.land.py in an editor
Update the values for BUCKET_RAW and BUCKET_LANDING with the appropriate values
Copy the code and paste it into the pyspark shell

Curate Data

Prerequisite: The output from the Land Data should be in your "landing" bucket ready for curation.

Connect to your EMR Cluster (as described in 02_EMR_Cluster)
Run pyspark
Open fda.02.curate.py in an editor
Update the values for BUCKET_LANDING and BUCKET_CURATED with the appropriate values
Copy the code and paste it into the pyspark shell

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FDA Drug Labels Data

Raw Data

Land Data

Curate Data

Next »

Files

README.md

Latest commit

History

README.md

File metadata and controls

FDA Drug Labels Data

Raw Data

Land Data

Curate Data

Next »