Changes in response to comments on ETL-34 #4

philerooski · 2021-09-16T16:53:07Z

New pipeline behavior:

New archives are broken up into their JSON datasets as soon as they come in
New files in each JSON dataset are processed on a schedule (~ every 10 minutes, TBD). This is a tradeoff between adding new data to the corresponding parquet datasets as quickly as possible and using the spark cluster more effectively -- e.g., not spinning up a spark cluster for each JSON dataset (see "organizational" below) simply to process data from a single archive.
Crawlers run once a day and only log changes to schemas. They do not modify schemas. Their main purpose is to add table metadata to the newly created partitions.

New organizational assumptions (how we organize tables, jobs, workflows):

Each study will have their glue tables in a separate glue database.
Each study will have two workflows, a s3_to_json workflow and a json_to_parquet workflow
The s3_to_json workflow is invoked by the lambda subscribed to the SNS topic, and in its turn will trigger a single job. This job (s3_to_json_s3) will be shared among all study-specific workflows. This is possible because the job's input and output are parameterized in the study-specific workflow.
The json_to_parquet workflow will trigger one job for each JSON dataset. This seems redundant, but it's necessary for using job bookmarks. These jobs are also shared among the study-specific workflows -- the different input and output behavior is parameterized in the study-specific workflow.
Each study will have two crawlers, one for "standard" JSON datasets (JSON objects) and another for "array_of_records" JSON datasets (JSON arrays). These crawlers aren't part of any workflows, but instead run once a day and do not modify Glue table schemas.

philerooski added 2 commits September 16, 2021 16:44

Changes in response to comments on ETL-34

151a393

Do not coalesce results

4a37411

philerooski requested a review from tthyer September 16, 2021 16:53

tthyer approved these changes Sep 16, 2021

View reviewed changes

Merge branch 'main' into phil-etl-34-comments

991ba92

philerooski merged commit cd9c4f6 into Sage-Bionetworks:main Sep 17, 2021

philerooski deleted the phil-etl-34-comments branch September 17, 2021 12:01

Provide feedback