Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_manifests fix for rows with no guid #196

Merged
merged 14 commits into from
Dec 6, 2023

Conversation

SpencerAxelrod
Copy link
Contributor

@SpencerAxelrod SpencerAxelrod commented Aug 28, 2023

Jira Ticket: PXP-11205

The merge_bucket_manifests() function in gen3.tools.indexing.merge_manifests should be able to merge the following files:

manifest_with_guid.tsv

guid size md5 acl authz urls metadata
dg.4503/1234abcd 1193060 2de45 phsXXXXXX.c1 admin AUTHZ_HERE gs://path s3://path some_data

manifest_WITHOUT_guid.tsv

guid size md5 acl authz urls metadata
1193060 2de45 phsXXXXXX.c1 admin AUTHZ_HERE gs://path s3://path some_data

However, depending on the order of the list of files passed into merge_bucket_manifests() the result will either merge the two into a single row, or keep them separate

This is due to the variable records_with_no_guid in the aforementioned function being initialized within the for loop that is iterating through the list of files to merge. Given a row in file manifest_WITHOUT_guid.tsv with md5: 2de45 , it will not be merged with the row in file manifest_with_guid.tsv with same md5: 2de45 if manifest_WITHOUT_guid.tsv is before in the list of files passed in. Initializing and processing records_with_no_guid outside the loop fixes this behavior.

New Features

Breaking Changes

Bug Fixes

  • Fixes issue with merge_bucket_manifests where the order of input manifests would inconsistent (and incorrect) merge behavior.

Improvements

Dependency updates

Deployment changes

@github-actions
Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

@@ -106,17 +106,23 @@ def merge_bucket_manifests(

headers = set()
all_rows = {}
records_with_no_guid = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move records_with_no_guid initialization, as well as the for record in records_with_no_guid loop to outside the for manifest in files: loop. This would process the records with no guid after all manifests rows have been initially processed.

# and NO guid was provided (e.g. we want a single combined row of updated values)
any_guid_provided = [
record.get(GUID_STANDARD_KEY)
for record in records_with_no_guid:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment re: moving this loop outside of for manifest in files loop

Copy link
Contributor

@k-burt-uch k-burt-uch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a regression test.

Also the bug fix section in your PR template is very thorough but unfortunately our release note automation won't be able to parse it properly due to the multiple line breaks.

I think something more succinct like

"Fixes issue with merge_bucket_manifests where the order of input manifests would inconsistent (and incorrect) merge behavior."

@Avantol13
Copy link
Contributor

Avantol13 commented Dec 6, 2023

Also the bug fix section in your PR template is very thorough but unfortunately our release note automation won't be able to parse it properly due to the multiple line breaks.

I think something more succinct like

"Fixes issue with merge_bucket_manifests where the order of input manifests would inconsistent (and incorrect) merge behavior."

You can leave the detailed description above the headings for posterity if someone looks at this PR again. But that's right, the bullet under the heading should be succinct, like one sentence

Copy link
Contributor

@k-burt-uch k-burt-uch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test provided that verifies the order of manifests does not matter when a guid is not provided in one manifest and that the end result is the same. Approved.

@paulineribeyre paulineribeyre merged commit 34935fd into master Dec 6, 2023
5 checks passed
@paulineribeyre paulineribeyre deleted the fix/merge_bucket_manifest_fix branch December 6, 2023 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants