[ETL-593] Record sort ordering for duplicate records #101

BryanFauble · 2024-01-09T20:41:36Z

Problem:

The duplicate record drop function has issues if all the following cases match:

There are multiple records with matching index fields
There are multiple records with the newest InsertedDate
The order of the intermediate ndjson file has the record we want later in the list of files

The issue is that data from the older InsertedDate record was being used instead of the newer.

Solution:
Update the sort order to take 2 keys for both InsertedDate and export_end_date

Testing:
In an earlier file (EnrolledParticipants_20230103.part0.ndjson) I have this record present:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"DateOfBirth": "2000-01-01",
	"PostalCode": "85741",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates": {
		"AppDownloadDate": "2024-01-01",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"EhrConnectedDate": "2024-01-01",
		"DevicesConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01"
	},
	"CustomFields": {
		"AppDownloadDate": "2024-01-01",
		"AppDownloaded": "1",
		"AppleHealthEnabled": "true",
		"AppleHealthRecordsEnabled": "true",
		"AppleHealthRecordsReceived": "true",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DateOfBirthVerified": "2024-01-01T01:01:01Z",
		"DeviceEligible": "1",
		"DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
		"DeviceOrderConfirmationNumber": "AAAAAAAAA",
		"DeviceOrderDate": "2024-01-01T01:01:01Z",
		"DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"DeviceOrderStatus": "placed",
		"DevicesConnected": "1",
		"DevicesConnectedDate": "2024-01-01",
		"EhrConnected": "1",
		"EhrConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"EOPDate": "2024-01-01T01:01:02Z",
		"EOPReason": "4",
		"EOPRemoveData": "",
		"HasOutstandingSurveys": "False",
		"InfectionsReported": "",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01",
		"ProjectCode": "AAAAAA",
		"Site": "Fake University"
	},
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-BBBB-BBBB-BBBB-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"export_start_date": null,
	"export_end_date": "2024-01-01T00:00:00",
	"cohort": "adults_v1"
}

In a later file (EnrolledParticipants_20230112.part0.ndjson) I modified the record to be:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"DateOfBirth": "2000-01-01",
	"PostalCode": "85741",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates": {
		"AppDownloadDate": "2024-01-01",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"EhrConnectedDate": "2024-01-01",
		"DevicesConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01"
	},
	"CustomFields": {
		"AppDownloadDate": "2024-01-01",
		"AppDownloaded": "1",
		"AppleHealthEnabled": "true",
		"AppleHealthRecordsEnabled": "true",
		"AppleHealthRecordsReceived": "true",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DateOfBirthVerified": "2024-01-01T01:01:01Z",
		"DeviceEligible": "1",
		"DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
		"DeviceOrderConfirmationNumber": "AAAAAAAAA",
		"DeviceOrderDate": "2024-01-01T01:01:01Z",
		"DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
		"DeviceOrderNurtureCampaignStart": "2024-01-02",
		"DeviceOrderStatus": "placed",
		"DevicesConnected": "1",
		"DevicesConnectedDate": "2024-01-01",
		"EhrConnected": "2",
		"EhrConnectedDate": "2024-01-02",
		"EhrNurtureCampaignStart": "2024-01-01",
		"EOPDate": "2024-01-01T01:01:02Z",
		"EOPReason": "5",
		"EOPRemoveData": "",
		"HasOutstandingSurveys": "False",
		"InfectionsReported": "",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01",
		"ProjectCode": "AAAAAA",
		"Site": "Fake University"
	},
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-CCCC-CCCC-CCCC-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"export_start_date": null,
	"export_end_date": "2024-01-03T00:00:00",
	"cohort": "adults_v1"
}

Before the change I introduced the output file was:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"Gender": null,
	"DateOfBirth": "2000-01-01",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates_AppDownloadDate": "2024-01-01",
	"EventDates_ConnectFitbitNurtureStart": "2024-01-01",
	"EventDates_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"EventDates_EhrConnectedDate": "2024-01-01",
	"EventDates_DevicesConnectedDate": "2024-01-01",
	"EventDates_EhrNurtureCampaignStart": "2024-01-01",
	"EventDates_JoinNurtureCampaignStart": "2024-01-01",
	"EventDates_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_Reminder2Enabled": null,
	"CustomFields_ReminderTime1": null,
	"CustomFields_ReminderTime2": null,
	"CustomFields_Treatments": NaN,
	"CustomFields_Symptoms": NaN,
	"CustomFields_Reminder1Enabled": null,
	"CustomFields_AppleHealthEnabled": "true",
	"CustomFields_AppleHealthRecordsEnabled": "true",
	"CustomFields_GoogleFitEnabled": null,
	"CustomFields_SkipConsent": null,
	"CustomFields_AppDownloadDate": "2024-01-01",
	"CustomFields_AppDownloaded": "1",
	"CustomFields_AppleHealthRecordsReceived": "true",
	"CustomFields_ConnectFitbitNurtureStart": "2024-01-01",
	"CustomFields_DateOfBirthVerified": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceEligible": "1",
	"CustomFields_DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
	"CustomFields_DeviceOrderConfirmationNumber": "AAAAAAAAA",
	"CustomFields_DeviceOrderDate": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
	"CustomFields_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"CustomFields_DeviceOrderStatus": "placed",
	"CustomFields_DevicesConnected": "1",
	"CustomFields_DevicesConnectedDate": "2024-01-01",
	"CustomFields_EhrConnected": "1",
	"CustomFields_EhrConnectedDate": "2024-01-01",
	"CustomFields_EhrNurtureCampaignStart": "2024-01-01",
	"CustomFields_EOPDate": "2024-01-01T01:01:02Z",
	"CustomFields_EOPReason": "4",
	"CustomFields_EOPRemoveData": "",
	"CustomFields_HasOutstandingSurveys": "False",
	"CustomFields_InfectionsReported": "",
	"CustomFields_JoinNurtureCampaignStart": "2024-01-01",
	"CustomFields_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_ProjectCode": "AAAAAA",
	"CustomFields_Site": "Fake University",
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"export_end_date": "2024-01-01T00:00:00",
	"MiddleName": null,
	"PostalCode": "85741",
	"ParticipantID": "BBBBBBBB-BBBB-BBBB-BBBB-BBBBBBBBBBBB",
	"PreferredLanguage": "en",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z"
}

After the changes I introduced the file matches the new expected data:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"Gender": null,
	"DateOfBirth": "2000-01-01",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates_AppDownloadDate": "2024-01-01",
	"EventDates_ConnectFitbitNurtureStart": "2024-01-01",
	"EventDates_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"EventDates_EhrConnectedDate": "2024-01-01",
	"EventDates_DevicesConnectedDate": "2024-01-01",
	"EventDates_EhrNurtureCampaignStart": "2024-01-01",
	"EventDates_JoinNurtureCampaignStart": "2024-01-01",
	"EventDates_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_AppleHealthEnabled": "true",
	"CustomFields_AppleHealthRecordsEnabled": "true",
	"CustomFields_GoogleFitEnabled": null,
	"CustomFields_Reminder1Enabled": null,
	"CustomFields_Reminder2Enabled": null,
	"CustomFields_ReminderTime1": null,
	"CustomFields_ReminderTime2": null,
	"CustomFields_SkipConsent": null,
	"CustomFields_Symptoms": NaN,
	"CustomFields_Treatments": NaN,
	"CustomFields_AppDownloadDate": "2024-01-01",
	"CustomFields_AppDownloaded": "1",
	"CustomFields_AppleHealthRecordsReceived": "true",
	"CustomFields_ConnectFitbitNurtureStart": "2024-01-01",
	"CustomFields_DateOfBirthVerified": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceEligible": "1",
	"CustomFields_DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
	"CustomFields_DeviceOrderConfirmationNumber": "AAAAAAAAA",
	"CustomFields_DeviceOrderDate": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
	"CustomFields_DeviceOrderNurtureCampaignStart": "2024-01-02",
	"CustomFields_DeviceOrderStatus": "placed",
	"CustomFields_DevicesConnected": "1",
	"CustomFields_DevicesConnectedDate": "2024-01-01",
	"CustomFields_EhrConnected": "2",
	"CustomFields_EhrConnectedDate": "2024-01-02",
	"CustomFields_EhrNurtureCampaignStart": "2024-01-01",
	"CustomFields_EOPDate": "2024-01-01T01:01:02Z",
	"CustomFields_EOPReason": "5",
	"CustomFields_EOPRemoveData": "",
	"CustomFields_HasOutstandingSurveys": "False",
	"CustomFields_InfectionsReported": "",
	"CustomFields_JoinNurtureCampaignStart": "2024-01-01",
	"CustomFields_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_ProjectCode": "AAAAAA",
	"CustomFields_Site": "Fake University",
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"export_end_date": "2024-01-03T00:00:00",
	"PostalCode": "85741",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-CCCC-CCCC-CCCC-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"MiddleName": null
}

BryanFauble · 2024-01-09T20:54:53Z

src/glue/jobs/json_to_parquet.py

@@ -191,34 +191,34 @@ def drop_table_duplicates(
    table_data_type = table_name_components[1]
    spark_df = table.toDF()
    if "InsertedDate" in spark_df.columns:
-        sorted_spark_df = spark_df.sort(spark_df.InsertedDate.desc())
+        sorted_spark_df = spark_df.sort(
+            [spark_df.InsertedDate.desc(), spark_df.export_end_date.desc()]


This was the fix for this jira. The rest of the changes in this file were auto formatting changes.

BryanFauble · 2024-01-09T20:55:47Z

tests/test_json_to_parquet.py

-                "2023-05-14T00:00:00",
-                "2023-05-14T00:00:00"
-            ]
+        "name": ["John", "John", "Jane", "Bob", "Bob_2"],


I added to this test case data with the problem situation with matching InsertedDate and a newer export_end_date

sonarqubecloud · 2024-01-09T21:50:30Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

59 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

rxu17

LGTM!

BryanFauble added 2 commits January 9, 2024 13:35

Record sort ordering for duplicate records

e6ee610

pre-commit

97c44f5

BryanFauble temporarily deployed to develop January 9, 2024 20:41 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 20:44 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 20:50 — with GitHub Actions Inactive

Unit test for duplicate export_end_date order

368c0fd

BryanFauble temporarily deployed to develop January 9, 2024 20:52 — with GitHub Actions Inactive

BryanFauble commented Jan 9, 2024

View reviewed changes

BryanFauble temporarily deployed to develop January 9, 2024 20:55 — with GitHub Actions Inactive

BryanFauble had a problem deploying to develop January 9, 2024 20:55 — with GitHub Actions Failure

BryanFauble commented Jan 9, 2024

View reviewed changes

Convert to set for compare

3bdf4b2

BryanFauble temporarily deployed to develop January 9, 2024 21:22 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 21:25 — with GitHub Actions Inactive

BryanFauble had a problem deploying to develop January 9, 2024 21:25 — with GitHub Actions Failure

Unit test correction

d92b241

BryanFauble temporarily deployed to develop January 9, 2024 21:50 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 21:53 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 21:59 — with GitHub Actions Inactive

BryanFauble temporarily deployed to develop January 9, 2024 22:01 — with GitHub Actions Inactive

BryanFauble marked this pull request as ready for review January 9, 2024 22:07

BryanFauble requested a review from a team as a code owner January 9, 2024 22:07

rxu17 approved these changes Jan 9, 2024

View reviewed changes

BryanFauble merged commit 84bbffb into main Jan 9, 2024
15 checks passed

BryanFauble deleted the etl-593-dupe-drop branch January 9, 2024 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-593] Record sort ordering for duplicate records #101

[ETL-593] Record sort ordering for duplicate records #101

BryanFauble commented Jan 9, 2024 •

edited

Loading

BryanFauble Jan 9, 2024

BryanFauble Jan 9, 2024

sonarqubecloud bot commented Jan 9, 2024

rxu17 left a comment

[ETL-593] Record sort ordering for duplicate records #101

[ETL-593] Record sort ordering for duplicate records #101

Conversation

BryanFauble commented Jan 9, 2024 • edited Loading

BryanFauble Jan 9, 2024

Choose a reason for hiding this comment

BryanFauble Jan 9, 2024

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 9, 2024

Quality Gate passed

rxu17 left a comment

Choose a reason for hiding this comment

BryanFauble commented Jan 9, 2024 •

edited

Loading