Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-294] create glue table stack #10

Merged
merged 8 commits into from
Feb 1, 2023
Merged

Conversation

rxu17
Copy link
Contributor

@rxu17 rxu17 commented Jan 27, 2023

Purpose: Creates a glue tables' stack along with relevant configs and table schemas that dynamically create glue tables for the recover datasets for each data type

  • Glue tables stack
  • Glue table schemas

@rxu17 rxu17 requested a review from a team as a code owner January 27, 2023 00:52
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Looks great @rxu17 . I'll let @philerooski take another pass at this.

templates/glue-tables.j2 Show resolved Hide resolved
templates/glue-tables.j2 Show resolved Hide resolved
- Name: Device
Type: struct<Name:string,Model:string,Manufacturer:string,HardwareVersion:string,SoftwareVersion:string,FirmwareVersion:string,LocalIdentifier:string,FDAIdentifier:string>
- Name: Metadata
Type: struct<HKMetadataKeySyncVersion:string,HKVO2MaxTestType:string,HKMetadataKeySyncIdentifier:string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these arbitrary JSON fields, reference the tables in test-database that I created using the crawlers. You'll find a few more fields there. For example, this is the struct for the Metadata field:

{
  "metadata": {
    "HKWasUserEntered": "string",
    "HKAlgorithmVersion": "string",
    "HKMetadataKeyHeartRateMotionContext": "string",
    "HKDateOfEarliestDataUsedForEstimate": "string",
    "HKMetadataKeyAppleDeviceCalibrated": "string",
    "HKTimeZone": "string",
    "HKMetadataKeySyncVersion": "string",
    "HKVO2MaxTestType": "string",
    "HKMetadataKeySyncIdentifier": "string",
    "HKMetadataKeyDevicePlacementSide": "string"
  }
}

@@ -0,0 +1,461 @@
tables:
EnrolledParticipants:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every table needs at least one of export_start_date or export_end_date fields. If only a single date is included in the file name (as it is for the EnrolledParticipants data type), then only export_end_date is needed. Otherwise both fields are needed.

Copy link
Contributor Author

@rxu17 rxu17 Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think I went down the route of seeing if the datatypes themselves have a start date or end date variable in their schemas instead

Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Description: !Sub 'Recover database'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a reference to the namespace in the Description

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
StoredAsSubDirectories: false
TableType: EXTERNAL_TABLE
{% endfor %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is indented once too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from BridgeDownstream and it had it as well. Will make a note to correct it there for syntax purposes. It doesn't appear to affect the code's working ability at least here.

@philerooski
Copy link
Contributor

philerooski commented Jan 31, 2023

I just noticed that at least one of the table names don't match the data types. For example, what in table_columns.yaml is called HealthKitV2Electrocardiogram_Samples should be HealthKitV2Electrocardiogram and HealthKitV2Heartbeats should be HealthKitV2Heartbeat. Can you match the data types as they appear in the recover-dev-intermediate-data bucket?

Re-request my review once you've made all your changes.

@rxu17 rxu17 merged commit dac3447 into main Feb 1, 2023
@rxu17 rxu17 deleted the etl-294-create-glue-table-stack branch February 1, 2023 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants