Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-593] Record sort ordering for duplicate records #101

Merged
merged 5 commits into from
Jan 9, 2024

Conversation

BryanFauble
Copy link
Contributor

@BryanFauble BryanFauble commented Jan 9, 2024

Problem:

  1. The duplicate record drop function has issues if all the following cases match:
  • There are multiple records with matching index fields
  • There are multiple records with the newest InsertedDate
  • The order of the intermediate ndjson file has the record we want later in the list of files

The issue is that data from the older InsertedDate record was being used instead of the newer.

Solution:
Update the sort order to take 2 keys for both InsertedDate and export_end_date

Testing:
In an earlier file (EnrolledParticipants_20230103.part0.ndjson) I have this record present:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"DateOfBirth": "2000-01-01",
	"PostalCode": "85741",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates": {
		"AppDownloadDate": "2024-01-01",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"EhrConnectedDate": "2024-01-01",
		"DevicesConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01"
	},
	"CustomFields": {
		"AppDownloadDate": "2024-01-01",
		"AppDownloaded": "1",
		"AppleHealthEnabled": "true",
		"AppleHealthRecordsEnabled": "true",
		"AppleHealthRecordsReceived": "true",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DateOfBirthVerified": "2024-01-01T01:01:01Z",
		"DeviceEligible": "1",
		"DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
		"DeviceOrderConfirmationNumber": "AAAAAAAAA",
		"DeviceOrderDate": "2024-01-01T01:01:01Z",
		"DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"DeviceOrderStatus": "placed",
		"DevicesConnected": "1",
		"DevicesConnectedDate": "2024-01-01",
		"EhrConnected": "1",
		"EhrConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"EOPDate": "2024-01-01T01:01:02Z",
		"EOPReason": "4",
		"EOPRemoveData": "",
		"HasOutstandingSurveys": "False",
		"InfectionsReported": "",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01",
		"ProjectCode": "AAAAAA",
		"Site": "Fake University"
	},
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-BBBB-BBBB-BBBB-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"export_start_date": null,
	"export_end_date": "2024-01-01T00:00:00",
	"cohort": "adults_v1"
}

In a later file (EnrolledParticipants_20230112.part0.ndjson) I modified the record to be:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"DateOfBirth": "2000-01-01",
	"PostalCode": "85741",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates": {
		"AppDownloadDate": "2024-01-01",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DeviceOrderNurtureCampaignStart": "2024-01-01",
		"EhrConnectedDate": "2024-01-01",
		"DevicesConnectedDate": "2024-01-01",
		"EhrNurtureCampaignStart": "2024-01-01",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01"
	},
	"CustomFields": {
		"AppDownloadDate": "2024-01-01",
		"AppDownloaded": "1",
		"AppleHealthEnabled": "true",
		"AppleHealthRecordsEnabled": "true",
		"AppleHealthRecordsReceived": "true",
		"ConnectFitbitNurtureStart": "2024-01-01",
		"DateOfBirthVerified": "2024-01-01T01:01:01Z",
		"DeviceEligible": "1",
		"DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
		"DeviceOrderConfirmationNumber": "AAAAAAAAA",
		"DeviceOrderDate": "2024-01-01T01:01:01Z",
		"DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
		"DeviceOrderNurtureCampaignStart": "2024-01-02",
		"DeviceOrderStatus": "placed",
		"DevicesConnected": "1",
		"DevicesConnectedDate": "2024-01-01",
		"EhrConnected": "2",
		"EhrConnectedDate": "2024-01-02",
		"EhrNurtureCampaignStart": "2024-01-01",
		"EOPDate": "2024-01-01T01:01:02Z",
		"EOPReason": "5",
		"EOPRemoveData": "",
		"HasOutstandingSurveys": "False",
		"InfectionsReported": "",
		"JoinNurtureCampaignStart": "2024-01-01",
		"LastFitbitTrackerStepsDate": "2024-01-01",
		"ProjectCode": "AAAAAA",
		"Site": "Fake University"
	},
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-CCCC-CCCC-CCCC-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"export_start_date": null,
	"export_end_date": "2024-01-03T00:00:00",
	"cohort": "adults_v1"
}

Before the change I introduced the output file was:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"Gender": null,
	"DateOfBirth": "2000-01-01",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates_AppDownloadDate": "2024-01-01",
	"EventDates_ConnectFitbitNurtureStart": "2024-01-01",
	"EventDates_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"EventDates_EhrConnectedDate": "2024-01-01",
	"EventDates_DevicesConnectedDate": "2024-01-01",
	"EventDates_EhrNurtureCampaignStart": "2024-01-01",
	"EventDates_JoinNurtureCampaignStart": "2024-01-01",
	"EventDates_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_Reminder2Enabled": null,
	"CustomFields_ReminderTime1": null,
	"CustomFields_ReminderTime2": null,
	"CustomFields_Treatments": NaN,
	"CustomFields_Symptoms": NaN,
	"CustomFields_Reminder1Enabled": null,
	"CustomFields_AppleHealthEnabled": "true",
	"CustomFields_AppleHealthRecordsEnabled": "true",
	"CustomFields_GoogleFitEnabled": null,
	"CustomFields_SkipConsent": null,
	"CustomFields_AppDownloadDate": "2024-01-01",
	"CustomFields_AppDownloaded": "1",
	"CustomFields_AppleHealthRecordsReceived": "true",
	"CustomFields_ConnectFitbitNurtureStart": "2024-01-01",
	"CustomFields_DateOfBirthVerified": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceEligible": "1",
	"CustomFields_DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
	"CustomFields_DeviceOrderConfirmationNumber": "AAAAAAAAA",
	"CustomFields_DeviceOrderDate": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
	"CustomFields_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"CustomFields_DeviceOrderStatus": "placed",
	"CustomFields_DevicesConnected": "1",
	"CustomFields_DevicesConnectedDate": "2024-01-01",
	"CustomFields_EhrConnected": "1",
	"CustomFields_EhrConnectedDate": "2024-01-01",
	"CustomFields_EhrNurtureCampaignStart": "2024-01-01",
	"CustomFields_EOPDate": "2024-01-01T01:01:02Z",
	"CustomFields_EOPReason": "4",
	"CustomFields_EOPRemoveData": "",
	"CustomFields_HasOutstandingSurveys": "False",
	"CustomFields_InfectionsReported": "",
	"CustomFields_JoinNurtureCampaignStart": "2024-01-01",
	"CustomFields_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_ProjectCode": "AAAAAA",
	"CustomFields_Site": "Fake University",
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"export_end_date": "2024-01-01T00:00:00",
	"MiddleName": null,
	"PostalCode": "85741",
	"ParticipantID": "BBBBBBBB-BBBB-BBBB-BBBB-BBBBBBBBBBBB",
	"PreferredLanguage": "en",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z"
}

After the changes I introduced the file matches the new expected data:

{
	"ParticipantIdentifier": "MDH-6614-8265",
	"GlobalKey": "A2D9D8D4-A682-EC11-AAA9-986E217866BF",
	"EmailAddress": "bryan.fauble@sagebase.org",
	"FirstName": "Test_First",
	"LastName": "Test_Last",
	"Gender": null,
	"DateOfBirth": "2000-01-01",
	"EnrollmentDate": "2024-01-01T01:01:01Z",
	"EventDates_AppDownloadDate": "2024-01-01",
	"EventDates_ConnectFitbitNurtureStart": "2024-01-01",
	"EventDates_DeviceOrderNurtureCampaignStart": "2024-01-01",
	"EventDates_EhrConnectedDate": "2024-01-01",
	"EventDates_DevicesConnectedDate": "2024-01-01",
	"EventDates_EhrNurtureCampaignStart": "2024-01-01",
	"EventDates_JoinNurtureCampaignStart": "2024-01-01",
	"EventDates_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_AppleHealthEnabled": "true",
	"CustomFields_AppleHealthRecordsEnabled": "true",
	"CustomFields_GoogleFitEnabled": null,
	"CustomFields_Reminder1Enabled": null,
	"CustomFields_Reminder2Enabled": null,
	"CustomFields_ReminderTime1": null,
	"CustomFields_ReminderTime2": null,
	"CustomFields_SkipConsent": null,
	"CustomFields_Symptoms": NaN,
	"CustomFields_Treatments": NaN,
	"CustomFields_AppDownloadDate": "2024-01-01",
	"CustomFields_AppDownloaded": "1",
	"CustomFields_AppleHealthRecordsReceived": "true",
	"CustomFields_ConnectFitbitNurtureStart": "2024-01-01",
	"CustomFields_DateOfBirthVerified": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceEligible": "1",
	"CustomFields_DeviceOrderCompleteDate": "2024-01-01T01:01:01.1111111+00:00",
	"CustomFields_DeviceOrderConfirmationNumber": "AAAAAAAAA",
	"CustomFields_DeviceOrderDate": "2024-01-01T01:01:01Z",
	"CustomFields_DeviceOrderInfo": "{\"Email\":\"bryan.fauble@sagebase.org\",\"Phone\":\"(123) 456-7890\",\"DeviceSku\":\"AAAAAAA\",\"DeviceName\":\"Fitbit Sense 2(Shadow Grey / Graphite Aluminium)\",\"ShippingAddress\":{\"Address1\":\"123 FAKE STREET\",\"Address2\":null,\"City\":\"TUCSON\",\"State\":\"AZ\",\"PostalCode\":\"85741\"}}",
	"CustomFields_DeviceOrderNurtureCampaignStart": "2024-01-02",
	"CustomFields_DeviceOrderStatus": "placed",
	"CustomFields_DevicesConnected": "1",
	"CustomFields_DevicesConnectedDate": "2024-01-01",
	"CustomFields_EhrConnected": "2",
	"CustomFields_EhrConnectedDate": "2024-01-02",
	"CustomFields_EhrNurtureCampaignStart": "2024-01-01",
	"CustomFields_EOPDate": "2024-01-01T01:01:02Z",
	"CustomFields_EOPReason": "5",
	"CustomFields_EOPRemoveData": "",
	"CustomFields_HasOutstandingSurveys": "False",
	"CustomFields_InfectionsReported": "",
	"CustomFields_JoinNurtureCampaignStart": "2024-01-01",
	"CustomFields_LastFitbitTrackerStepsDate": "2024-01-01",
	"CustomFields_ProjectCode": "AAAAAA",
	"CustomFields_Site": "Fake University",
	"UtcOffset": "-05:00:00",
	"TimeZone": "America/New_York",
	"export_end_date": "2024-01-03T00:00:00",
	"PostalCode": "85741",
	"PreferredLanguage": "en",
	"ParticipantID": "BBBBBBBB-CCCC-CCCC-CCCC-BBBBBBBBBBBB",
	"MobilePhone": "(123) 456-7890",
	"UnsubscribedFromEmails": "true",
	"UnsubscribedFromSMS": "false",
	"InsertedDate": "2024-01-01T01:01:01Z",
	"MiddleName": null
}

@@ -191,34 +191,34 @@ def drop_table_duplicates(
table_data_type = table_name_components[1]
spark_df = table.toDF()
if "InsertedDate" in spark_df.columns:
sorted_spark_df = spark_df.sort(spark_df.InsertedDate.desc())
sorted_spark_df = spark_df.sort(
[spark_df.InsertedDate.desc(), spark_df.export_end_date.desc()]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the fix for this jira. The rest of the changes in this file were auto formatting changes.

"2023-05-14T00:00:00",
"2023-05-14T00:00:00"
]
"name": ["John", "John", "Jane", "Bob", "Bob_2"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to this test case data with the problem situation with matching InsertedDate and a newer export_end_date

Copy link

sonarqubecloud bot commented Jan 9, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

59 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@BryanFauble BryanFauble merged commit 84bbffb into main Jan 9, 2024
15 checks passed
@BryanFauble BryanFauble deleted the etl-593-dupe-drop branch January 9, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants