[ETL-294] create glue table stack #10

rxu17 · 2023-01-27T00:52:54Z

Purpose: Creates a glue tables' stack along with relevant configs and table schemas that dynamically create glue tables for the recover datasets for each data type

Glue tables stack
Glue table schemas

thomasyu888

🔥 Looks great @rxu17 . I'll let @philerooski take another pass at this.

templates/glue-tables.j2

philerooski · 2023-01-27T20:41:54Z

src/glue/resources/table_columns.yaml

+      - Name: Device
+        Type: struct<Name:string,Model:string,Manufacturer:string,HardwareVersion:string,SoftwareVersion:string,FirmwareVersion:string,LocalIdentifier:string,FDAIdentifier:string>
+      - Name: Metadata
+        Type: struct<HKMetadataKeySyncVersion:string,HKVO2MaxTestType:string,HKMetadataKeySyncIdentifier:string>


For these arbitrary JSON fields, reference the tables in test-database that I created using the crawlers. You'll find a few more fields there. For example, this is the struct for the Metadata field:

{ "metadata": { "HKWasUserEntered": "string", "HKAlgorithmVersion": "string", "HKMetadataKeyHeartRateMotionContext": "string", "HKDateOfEarliestDataUsedForEstimate": "string", "HKMetadataKeyAppleDeviceCalibrated": "string", "HKTimeZone": "string", "HKMetadataKeySyncVersion": "string", "HKVO2MaxTestType": "string", "HKMetadataKeySyncIdentifier": "string", "HKMetadataKeyDevicePlacementSide": "string" } }

philerooski · 2023-01-27T20:49:18Z

src/glue/resources/table_columns.yaml

@@ -0,0 +1,461 @@
+tables:
+  EnrolledParticipants:


Every table needs at least one of export_start_date or export_end_date fields. If only a single date is included in the file name (as it is for the EnrolledParticipants data type), then only export_end_date is needed. Otherwise both fields are needed.

Got it! I think I went down the route of seeing if the datatypes themselves have a start date or end date variable in their schemas instead

philerooski · 2023-01-27T20:53:43Z

templates/glue-tables.j2

+    Properties:
+      CatalogId: !Ref AWS::AccountId
+      DatabaseInput:
+        Description: !Sub 'Recover database'


Include a reference to the namespace in the Description

philerooski · 2023-01-27T20:54:26Z

templates/glue-tables.j2

+          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+          StoredAsSubDirectories: false
+        TableType: EXTERNAL_TABLE
+    {% endfor %}


I think this is indented once too much.

I copied this from BridgeDownstream and it had it as well. Will make a note to correct it there for syntax purposes. It doesn't appear to affect the code's working ability at least here.

…n definitions

philerooski · 2023-01-31T19:47:28Z

I just noticed that at least one of the table names don't match the data types. For example, what in table_columns.yaml is called HealthKitV2Electrocardiogram_Samples should be HealthKitV2Electrocardiogram and HealthKitV2Heartbeats should be HealthKitV2Heartbeat. Can you match the data types as they appear in the recover-dev-intermediate-data bucket?

Re-request my review once you've made all your changes.

Rixing Xu added 3 commits January 26, 2023 15:41

add glue table stacks, columns, configs for develop, update pre-commit

6fc1f41

add placeholder stack for prod glue tables

fcc940f

correct columns, add add. table enrolled participants

8d2120e

rxu17 requested a review from a team as a code owner January 27, 2023 00:52

thomasyu888 approved these changes Jan 27, 2023

View reviewed changes

templates/glue-tables.j2 Show resolved Hide resolved

remove unused resources

b04acb5

philerooski requested changes Jan 27, 2023

View reviewed changes

Rixing Xu added 2 commits January 27, 2023 14:12

fix syntax, add more description

a153db3

adjust table schema with current export date variables, correct colum…

465280f

…n definitions

Rixing Xu added 2 commits January 31, 2023 15:47

update to match glue crawler schemas

a084f56

adjust table name

287bdc4

rxu17 requested a review from philerooski January 31, 2023 23:51

philerooski approved these changes Feb 1, 2023

View reviewed changes

rxu17 merged commit dac3447 into main Feb 1, 2023

rxu17 deleted the etl-294-create-glue-table-stack branch February 1, 2023 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-294] create glue table stack #10

[ETL-294] create glue table stack #10

rxu17 commented Jan 27, 2023 •

edited

Loading

thomasyu888 left a comment

philerooski Jan 27, 2023

philerooski Jan 27, 2023

rxu17 Jan 27, 2023 •

edited

Loading

philerooski Jan 27, 2023

philerooski Jan 27, 2023

rxu17 Jan 27, 2023

philerooski commented Jan 31, 2023 •

edited

Loading

[ETL-294] create glue table stack #10

[ETL-294] create glue table stack #10

Conversation

rxu17 commented Jan 27, 2023 • edited Loading

thomasyu888 left a comment

Choose a reason for hiding this comment

philerooski Jan 27, 2023

Choose a reason for hiding this comment

philerooski Jan 27, 2023

Choose a reason for hiding this comment

rxu17 Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

philerooski Jan 27, 2023

Choose a reason for hiding this comment

philerooski Jan 27, 2023

Choose a reason for hiding this comment

rxu17 Jan 27, 2023

Choose a reason for hiding this comment

philerooski commented Jan 31, 2023 • edited Loading

rxu17 commented Jan 27, 2023 •

edited

Loading

rxu17 Jan 27, 2023 •

edited

Loading

philerooski commented Jan 31, 2023 •

edited

Loading