-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-593] Record sort ordering for duplicate records #101
Conversation
@@ -191,34 +191,34 @@ def drop_table_duplicates( | |||
table_data_type = table_name_components[1] | |||
spark_df = table.toDF() | |||
if "InsertedDate" in spark_df.columns: | |||
sorted_spark_df = spark_df.sort(spark_df.InsertedDate.desc()) | |||
sorted_spark_df = spark_df.sort( | |||
[spark_df.InsertedDate.desc(), spark_df.export_end_date.desc()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the fix for this jira. The rest of the changes in this file were auto formatting changes.
"2023-05-14T00:00:00", | ||
"2023-05-14T00:00:00" | ||
] | ||
"name": ["John", "John", "Jane", "Bob", "Bob_2"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added to this test case data with the problem situation with matching InsertedDate
and a newer export_end_date
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 59 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Problem:
InsertedDate
The issue is that data from the older
InsertedDate
record was being used instead of the newer.Solution:
Update the sort order to take 2 keys for both
InsertedDate
andexport_end_date
Testing:
In an earlier file (
EnrolledParticipants_20230103.part0.ndjson
) I have this record present:In a later file (
EnrolledParticipants_20230112.part0.ndjson
) I modified the record to be:Before the change I introduced the output file was:
After the changes I introduced the file matches the new expected data: