Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in response to comments on ETL-34 #4

Merged
merged 3 commits into from
Sep 17, 2021
Merged

Changes in response to comments on ETL-34 #4

merged 3 commits into from
Sep 17, 2021

Conversation

philerooski
Copy link
Contributor

New pipeline behavior:

  • New archives are broken up into their JSON datasets as soon as they come in
  • New files in each JSON dataset are processed on a schedule (~ every 10 minutes, TBD). This is a tradeoff between adding new data to the corresponding parquet datasets as quickly as possible and using the spark cluster more effectively -- e.g., not spinning up a spark cluster for each JSON dataset (see "organizational" below) simply to process data from a single archive.
  • Crawlers run once a day and only log changes to schemas. They do not modify schemas. Their main purpose is to add table metadata to the newly created partitions.

New organizational assumptions (how we organize tables, jobs, workflows):

  • Each study will have their glue tables in a separate glue database.
  • Each study will have two workflows, a s3_to_json workflow and a json_to_parquet workflow
  • The s3_to_json workflow is invoked by the lambda subscribed to the SNS topic, and in its turn will trigger a single job. This job (s3_to_json_s3) will be shared among all study-specific workflows. This is possible because the job's input and output are parameterized in the study-specific workflow.
  • The json_to_parquet workflow will trigger one job for each JSON dataset. This seems redundant, but it's necessary for using job bookmarks. These jobs are also shared among the study-specific workflows -- the different input and output behavior is parameterized in the study-specific workflow.
  • Each study will have two crawlers, one for "standard" JSON datasets (JSON objects) and another for "array_of_records" JSON datasets (JSON arrays). These crawlers aren't part of any workflows, but instead run once a day and do not modify Glue table schemas.

@philerooski philerooski requested a review from tthyer September 16, 2021 16:53
@philerooski philerooski merged commit cd9c4f6 into Sage-Bionetworks:main Sep 17, 2021
@philerooski philerooski deleted the phil-etl-34-comments branch September 17, 2021 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants