Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_manifests fix for rows with no guid #196

Merged
merged 14 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/tools/indexing.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/tools/metadata.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/_build/html/tools/indexing.html
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ <h1>Indexing Tools<a class="headerlink" href="#indexing-tools" title="Link to th

<dl class="py function">
<dt class="sig sig-object py" id="gen3.tools.indexing.verify_manifest.async_verify_object_manifest">
<em class="property"><span class="k"><span class="pre">async</span></span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">gen3.tools.indexing.verify_manifest.</span></span><span class="sig-name descname"><span class="pre">async_verify_object_manifest</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">commons_url</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_concurrent_requests=24</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_row_parsers={'acl':</span> <span class="pre">&lt;function</span> <span class="pre">_get_acl_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'authz':</span> <span class="pre">&lt;function</span> <span class="pre">_get_authz_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'file_name':</span> <span class="pre">&lt;function</span> <span class="pre">_get_file_name_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'file_size':</span> <span class="pre">&lt;function</span> <span class="pre">_get_file_size_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'guid':</span> <span class="pre">&lt;function</span> <span class="pre">_get_guid_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'md5':</span> <span class="pre">&lt;function</span> <span class="pre">_get_md5_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'urls':</span> <span class="pre">&lt;function</span> <span class="pre">_get_urls_from_row&gt;}</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file_delimiter=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_filename='verify-manifest-errors-1700678999.9725351.log'</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Link to this definition">¶</a></dt>
<em class="property"><span class="k"><span class="pre">async</span></span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">gen3.tools.indexing.verify_manifest.</span></span><span class="sig-name descname"><span class="pre">async_verify_object_manifest</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">commons_url</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_concurrent_requests=24</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_row_parsers={'acl':</span> <span class="pre">&lt;function</span> <span class="pre">_get_acl_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'authz':</span> <span class="pre">&lt;function</span> <span class="pre">_get_authz_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'file_name':</span> <span class="pre">&lt;function</span> <span class="pre">_get_file_name_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'file_size':</span> <span class="pre">&lt;function</span> <span class="pre">_get_file_size_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'guid':</span> <span class="pre">&lt;function</span> <span class="pre">_get_guid_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'md5':</span> <span class="pre">&lt;function</span> <span class="pre">_get_md5_from_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'urls':</span> <span class="pre">&lt;function</span> <span class="pre">_get_urls_from_row&gt;}</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file_delimiter=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_filename='verify-manifest-errors-1701900364.6706986.log'</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Link to this definition">¶</a></dt>
<dd><p>Verify all file object records into a manifest csv</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/tools/metadata.html
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ <h1>Metadata Tools<a class="headerlink" href="#metadata-tools" title="Link to th

<dl class="py function">
<dt class="sig sig-object py" id="gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest">
<em class="property"><span class="k"><span class="pre">async</span></span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">gen3.tools.metadata.ingest_manifest.</span></span><span class="sig-name descname"><span class="pre">async_ingest_metadata_manifest</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">commons_url</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">metadata_source</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">auth=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_concurrent_requests=24</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_row_parsers={'guid_for_row':</span> <span class="pre">&lt;function</span> <span class="pre">_get_guid_for_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'indexed_file_object_guid':</span> <span class="pre">&lt;function</span> <span class="pre">_query_for_associated_indexd_record_guid&gt;}</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file_delimiter=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_filename='ingest-metadata-manifest-errors-1700679000.272037.log'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">get_guid_from_file=True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">metadata_type=None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Link to this definition">¶</a></dt>
<em class="property"><span class="k"><span class="pre">async</span></span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">gen3.tools.metadata.ingest_manifest.</span></span><span class="sig-name descname"><span class="pre">async_ingest_metadata_manifest</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">commons_url</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">metadata_source</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">auth=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_concurrent_requests=24</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_row_parsers={'guid_for_row':</span> <span class="pre">&lt;function</span> <span class="pre">_get_guid_for_row&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">'indexed_file_object_guid':</span> <span class="pre">&lt;function</span> <span class="pre">_query_for_associated_indexd_record_guid&gt;}</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">manifest_file_delimiter=None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_filename='ingest-metadata-manifest-errors-1701900364.9645383.log'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">get_guid_from_file=True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">metadata_type=None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Link to this definition">¶</a></dt>
<dd><p>Ingest all metadata records into a manifest csv</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
Expand Down
61 changes: 33 additions & 28 deletions gen3/tools/indexing/merge_manifests.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,17 +106,23 @@ def merge_bucket_manifests(

headers = set()
all_rows = {}
records_with_no_guid = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move records_with_no_guid initialization, as well as the for record in records_with_no_guid loop to outside the for manifest in files: loop. This would process the records with no guid after all manifests rows have been initially processed.

for manifest in files:
records_from_file, _ = get_and_verify_fileinfos_from_manifest(
manifest, include_additional_columns=True
)
records_with_no_guid = []
for record in records_from_file:
# simple case where this is the first time we've seen this hash
headers.update(record.keys())
if record[MD5_STANDARD_KEY] not in all_rows:
record_to_write = copy.deepcopy(record)
all_rows[record_to_write[MD5_STANDARD_KEY]] = [record_to_write]

new_guid = record.get(GUID_STANDARD_KEY)
if not new_guid:
# since there's no guid specified to differentiate this from other
# entries, we will add metadata to all records later
records_with_no_guid.append(record)
else:
# if the hash already exists, we need to try and update existing
# entries with any new data (and ensure we don't add duplicates)
Expand All @@ -143,36 +149,35 @@ def merge_bucket_manifests(
)
all_rows[record[MD5_STANDARD_KEY]] = updated_records.values()

# for the entries where there was no GUID specified, we will add that metadata
# to all previous records
for record in records_with_no_guid:
updated_records = _get_updated_records(
record=record,
existing_records=all_rows.get(record[MD5_STANDARD_KEY], []),
continue_after_error=continue_after_error,
allow_mult_guids_per_hash=allow_mult_guids_per_hash,
columns_with_arrays=columns_with_arrays,
)
# it's possible a record without a GUID got added if it was the FIRST
# instance of that md5, so we just need to make sure that it's removed
# if there was another GUID provided later on
#
# this also handles the edge case where there were multiple rows for the md5
# and NO guid was provided (e.g. we want a single combined row of updated values)
any_guid_provided = [
record.get(GUID_STANDARD_KEY)
# for the entries where there was no GUID specified, we will add that metadata
# to all previous records
for record in records_with_no_guid:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment re: moving this loop outside of for manifest in files loop

updated_records = _get_updated_records(
record=record,
existing_records=all_rows.get(record[MD5_STANDARD_KEY], []),
continue_after_error=continue_after_error,
allow_mult_guids_per_hash=allow_mult_guids_per_hash,
columns_with_arrays=columns_with_arrays,
)
# it's possible a record without a GUID got added if it was the FIRST
# instance of that md5, so we just need to make sure that it's removed
# if there was another GUID provided later on
#
# this also handles the edge case where there were multiple rows for the md5
# and NO guid was provided (e.g. we want a single combined row of updated values)
any_guid_provided = [
record.get(GUID_STANDARD_KEY)
for record in updated_records.values()
if record.get(GUID_STANDARD_KEY)
]
if not any_guid_provided:
all_rows[record[MD5_STANDARD_KEY]] = updated_records.values()
else:
all_rows[record[MD5_STANDARD_KEY]] = [
record
for record in updated_records.values()
if record.get(GUID_STANDARD_KEY)
]
if not any_guid_provided:
all_rows[record[MD5_STANDARD_KEY]] = updated_records.values()
else:
all_rows[record[MD5_STANDARD_KEY]] = [
record
for record in updated_records.values()
if record.get(GUID_STANDARD_KEY)
]

_create_output_file(
output_manifest, headers, all_rows, output_manifest_file_delimiter
)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
guid size md5 acl authz urls metadata
dg.4503/1234abcd 1193060 2de45cf5e6b2639c98b56b679cffc119 admin phsXXXXXX.c1 AUTHZ_HERE gs://path s3://path some_data
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
guid size md5 acl authz urls metadata
1193060 2de45cf5e6b2639c98b56b679cffc119 phsXXXXXX.c1 admin AUTHZ_HERE gs://path s3://path some_data
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
guid size md5 acl authz urls metadata
dg.4503/1234abcd 1193060 2de45cf5e6b2639c98b56b679cffc119 phsXXXXXX.c1 admin AUTHZ_HERE gs://path s3://path some_data
31 changes: 31 additions & 0 deletions tests/merge_manifests/test_manifest_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,37 @@ def test_size_mismatch():
)


def test_no_guid_input_order():
"""
Test merging two input manifests with that have the same md5, but one has no guid.
Ensure that the order of the input results in the same correct output.
"""
file_order_1 = [
"tests/merge_manifests/no_guid_same_md5_order/input/manifest_with_guid.tsv",
"tests/merge_manifests/no_guid_same_md5_order/input/manifest_WITHOUT_guid.tsv",
]
file_order_2 = [
"tests/merge_manifests/no_guid_same_md5_order/input/manifest_WITHOUT_guid.tsv",
"tests/merge_manifests/no_guid_same_md5_order/input/manifest_with_guid.tsv",
]
merge_bucket_manifests(
files=file_order_1,
output_manifest_file_delimiter=None,
output_manifest="merged-output-test-manifest.tsv",
)
assert _get_tsv_data("merged-output-test-manifest.tsv") == _get_tsv_data(
"tests/merge_manifests/no_guid_same_md5_order/expected-merged-output-manifest.tsv"
)
merge_bucket_manifests(
files=file_order_2,
output_manifest_file_delimiter=None,
output_manifest="merged-output-test-manifest.tsv",
)
assert _get_tsv_data("merged-output-test-manifest.tsv") == _get_tsv_data(
"tests/merge_manifests/no_guid_same_md5_order/expected-merged-output-manifest.tsv"
)


def _get_tsv_data(manifest, delimiter="\t"):
"""
Returns a list of rows sorted by md5 for the given manifest.
Expand Down