The FDA provides a wealth of large data sets via the OpenFDA website and APIs. We will leverage one of these data sets in particular; the Drug Labels.
The FDA provides access to all of the OpenFDA data via a public S3 bucket. We can pull this data in directly using the following command:
aws s3 sync s3://download.open.fda.gov/2019-08-10/drug/label s3://$STACK_NAME-raw/fda-source/2019-08-10/drug/label
Windows:
aws s3 sync s3://download.open.fda.gov/2019-08-10/drug/label s3://%STACK_NAME%-raw/fda-source/2019-08-10/drug/label
Note: The above command assumes that you have the
STACK_NAME
variable set in your terminal from our setup steps.
Objectives for this section:
- Take the raw
zip
"parts" as an input deflate
the files and save asgzip
compressedjson
files - this puts them in a format we can interact with using Spark- Read the compressed
json
files as a DataFrame - Select the property we want to extract from the large object
explode
the list of results creating a new structure where each drug is now a row- Land the new optimized data set as
gzip
compressedjson
files
Steps:
- Connect to your EMR Cluster (as described in 02_EMR_Cluster)
- Run
pyspark --driver-memory 10G --executor-memory 10G --executor-cores 1
- Open
fda.01.land.py
in an editor - Update the values for
BUCKET_RAW
andBUCKET_LANDING
with the appropriate values - Copy the code and paste it into the
pyspark
shell
Prerequisite: The output from the Land Data should be in your "landing" bucket ready for curation.
- Connect to your EMR Cluster (as described in 02_EMR_Cluster)
- Run
pyspark
- Open
fda.02.curate.py
in an editor - Update the values for
BUCKET_LANDING
andBUCKET_CURATED
with the appropriate values - Copy the code and paste it into the
pyspark
shell