Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user, I want validate to throw an error when a file is being referenced by more than one label #755

Closed
tbarnes4 opened this issue Nov 8, 2023 · 15 comments Β· Fixed by #769

Comments

@tbarnes4
Copy link

tbarnes4 commented Nov 8, 2023

Checked for duplicates

Yes - I've already checked

πŸ§‘β€πŸ”¬ User Persona(s)

All the above; anyone that is validating collections and bundles or a directory of products.

πŸ’ͺ Motivation

According to the standards a file should only be described once by a label, and then referenced with a LID if multiple files want to point to it. Currently validate does not check for this. We recently found a case in the dart_teleobs:data_mroraw where a single file was described by a File_Area_Observational_Supplemental within multiple xml labels. No one thought to check for this, but I would hope this would be easy to check for this, especially since we already check to see if each file has a label.

Very similar in nature to #245

πŸ“– Additional Details

Example: https://pdssbn.astro.umd.edu/holdings/pds4-dart_teleobs:data_mroraw-v1.0/mro_221011/

The file "comps_221011.png" is listed in the File_Area_Observational_Supplemental for 30 files in this directory, the first three of which listed below:

  • m221011.0013.xml
  • m221011.0014.xml
  • m221011.0030.xml

Acceptance Criteria

Given a set of label files where >1 of them references the same file from File_Area_*
When I perform validate --target product1.xml product2.xml etc.
Then I expect validate to throw an error

βš™οΈ Engineering Details

When keeping inventory of the files that have been referenced, we can drop the inventory when we move to another directory.

I&T

TestRail Test ID: T8681202

@jordanpadams
Copy link
Member

@tbarnes4 can you provide the location within the standards where this rule is described?

@tbarnes4
Copy link
Author

tbarnes4 commented Nov 8, 2023

@jordanpadams Funny about that. I was trying to find it in the standards, and the closest I got was Section 3 Labels, par.5 starting with "Under PDS4, all product labels ...", that says there should be only one label for each product. I cornered Anne and she confirmed to me that each object can only be described once, not multiple times. She did ponder though how the registry would handle multiple xml files pointing to the same file object.

@jordanpadams
Copy link
Member

jordanpadams commented Nov 8, 2023

@tbarnes4 Copy. So to clarify, if you are a data provider, the better route for handling this supplemental file would have been to describe it as some other data object (e.g. Product_Browse) and then reference it from the Reference_List. We will look at how we might be able to implement this with Validate.

She did ponder though how the registry would handle multiple xml files pointing to the same file object.

The Registry doesn't care. It will just load the metadata and the pointer, and it will have a file_ref with a URL to the supplemental file.

@github-project-automation github-project-automation bot moved this to Release Backlog in B14.1 Nov 10, 2023
@jordanpadams jordanpadams changed the title As a user, I want to see when a file is being described by more than one label As a user, I want validate to throw an error when a file is being referenced by more than one label Nov 10, 2023
@al-niessner
Copy link
Contributor

@jordanpadams @tbarnes4

I appreciate what is being requested and can partially support it. It will require a sweeper as well to make it complete. For instance, if validate fails with a.xml and b.xml when given in a single command, there is nothing to stop the user from passing both with validate a.xml then validate b.xml - not implying maliciousness but just need to get it done to meet a deadline or just wrote the script to do one xml at a time. The only way to truly prevent two labels pointing at the same file is to do a partial check in validate, then do it again at harvest, then do it again with a sweeper for those that evade the fool-proof previous checks.

Does this apply to LID or LIDVID? In other words, can urn::1.0 point at the same file as urn::1.1?

@jordanpadams
Copy link
Member

jordanpadams commented Nov 14, 2023

@al-niessner 100% agree on the sweeper. we eventually want to provide much more comprehensive referential integrity checks behind the scenes. for now, we just want to do this check on the file system.

Does this apply to LID or LIDVID? In other words, can urn::1.0 point at the same file as urn::1.1?

Great question. @tbarnes4 any idea if new versions of a label can point to the same product? for instance, what about when someone updates a label, but doesn't touch the data? it seems silly to have to create an entirely new version of the product file.

@tbarnes4
Copy link
Author

@al-niessner @jordanpadams

Does this apply to LID or LIDVID? In other words, can urn::1.0 point at the same file as urn::1.1?

Yes, this should be fine and validate should not flag this as an issue. As @jordan mentioned concerning label only changes will happen a lot. But in this case, I would assume (back to this again), that within a collection version or latest bundle version, a product LID would only ever be used once. Would or should there ever be case where a LID is used more than once (but different LIDVID) within a single collection version? I would imagine it should never be the case (but what is stopping that?). And I think that that is where the scope of validate (at this time) should end, within a collection version. Thinking out loud here, so if I validate a collection or a bundle or folder (?), validate should expect that each file has only one LID associated with it per collection LIDVID (in the case a folder or bundle contains multiple versions of a collection within its directory structure).

Thinking of other possible situations, is it possible a file in a future collection version would have a different LID pointing to a file? For instance in collection vX.0 LID-A points to file M and file N, but in collection vY.0, LID-A no longer points to file N, but LID-B now points to file P and file N. How would validate resolve this situation if it was validating a folder or bundle that had both vX.0 and vY.0 of the collection?

Sanity question, does validate check that there is only one case of LID (LIDVID too?) being part of a collection, in other words a duplicate LID usage? Is anything stopping a bundle version from having a collection listed multiple times but with multiple versions (LIDVIDs)?

For instance, if validate fails with a.xml and b.xml when given in a single command, there is nothing to stop the user from passing both with validate a.xml then validate b.xml - not implying maliciousness but just need to get it done to meet a deadline or just wrote the script to do one xml at a time. The only way to truly prevent two labels pointing at the same file is to do a partial check in validate, then do it again at harvest, then do it again with a sweeper for those that evade the fool-proof previous checks.

If validate is asked to validate a single product, I do not think it should query the universe to see if another label is pointing to the same file on my local file system (on the cloud is an interesting question though). This kind of check should really only be done as a kind of referential integrity check of a collection or bundle and perhaps only warn users (at most) if they are validating multiple products at once outside the collection/bundle checks.

@tbarnes4
Copy link
Author

Would or should there ever be case where a LID is used more than once (but different LIDVID) within a single collection version? I would imagine it should never be the case (but what is stopping that?). And I think that that is where the scope of validate (at this time) should end, within a collection version. Thinking out loud here, so if I validate a collection or a bundle or folder (?), validate should expect that each file has only one LID associated with it per collection LIDVID (in the case a folder or bundle contains multiple versions of a collection within its directory structure).

Nevermind, I think I can think of a use case (haven't check standards yet to know if this is allowed or not). In the case where we are migrating data sets from PDS3 where it is important to preserve multiple versions of a document. Instead of creating multiple versions of a collection, just include all versions of the LID within the first released collection. So in this case we could have multiple instances of a LID, but no duplicate LIDVID. But this case is different from the thread above because in this instance, we would have multiple different document files, not a single version of a document with multiple labels.

@jordanpadams
Copy link
Member

@tbarnes4

copy. so to clarify, we are saying that multiple versions of a LID can reference the same data file, but 2 different LIDs cannot reference the same file.

@tbarnes4
Copy link
Author

@jordanpadams

Yes, that is what I'd lean towards. Thanks for talking/thinking it through with me.

@al-niessner
Copy link
Contributor

@jordanpadams @tbarnes4

One more complication that may or may not be realistic:

Collection/Bundle reference via LID labels a.xml and b.xml. a.xml version 1,0 -> 2.1 inclusive has a file area for snafu.csv. The decision is to then move snafu.csv from a.xml and b.xml making them versions 3.0 and 2.0 respectively. Since the bundles and collections only use the LID, there is no way other than checking the last only to do the processing. In other words, when a LID is given instead of a LIDVID, always use the latest version. Does this produce the desired behavior?

@tbarnes4
Copy link
Author

@al-niessner @jordanpadams

I don't know. I would expect validate to check all files and all LID versions, as limited by what is provided to validate and what options are invoked with validate. I cannot foresee a situation where within a single collection version, where there would be multiple VIDs for a single LID, AND snafu.csv is moving from one LID to another LID. In this case we would have to ensure that each collection version has only one LID per object file.

Correct me if I'm wrong, but collections must use LIDVIDs in their inventory file. Only bundles have the option of providing LIDs or LIDVIDs for collections, though I thought we were in the process of forcing bundles only use LIDVIDs of collections.

@jordanpadams
Copy link
Member

@al-niessner @tbarnes4 if someone points at a specific bundle.xml with the pds4.bundle rule enabled, I agree if they reference a collection by LID we should just look at the latest versions of the collections and check from there. Past versions of the collections shouldn't matter in this context. If someone wants to check past versions of a collection, they can point at that collection and use the pds4.collection rule.

@smclaughlin7
Copy link

@jordanpadams Should NASA-PDS/validate#245 be linked here (to NASA-PDS/validate#755)?

@al-niessner
Copy link
Contributor

@jordanpadams

After plotting a strategy yesterday, it mostly worked today with some tactical adjustments. It is leaving me with a really bit problem. How to report the error meaningfully. Cannot do the entire job until all file areas have been found. Since there is no way to know in validating a label what the last label will be, had to make the check global (post apache chain stuff). Here is what the summary looks like for `validate -t m221011.0013.xml m221011.0014.xml m221011.0030.xml':

Summary:

  1 error(s)
  0 warning(s)

  Product Validation Summary:
    3          product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    1            error.label.file_areas_duplicated_reference

End of Report
Completed execution in 6327 ms

The error is this:

  FAIL: file:/home/niessner/Projects/PDS/validate/src/test/resources/github755/m221011.0014.xml
      ERROR  [error.label.file_areas_duplicated_reference]   The labels urn:nasa:pds:dart_teleobs:data_mroraw:m221011.0014::1.0, urn:nasa:pds:dart_teleobs:data_mroraw:m221011.0013::1.0, urn:nasa:pds:dart_teleobs:data_mroraw:m221011.0030::1.0 in files /home/niessner/Projects/PDS/validate/src/test/resources/github755/m221011.0014.xml, /home/niessner/Projects/PDS/validate/src/test/resources/github755/m221011.0013.xml, /home/niessner/Projects/PDS/validate/src/test/resources/github755/m221011.0030.xml respectively all contain references to the file comps_221011.png
        4 product validation(s) completed

The problem I cannot resolve it how to signal the failure. The summary says 3 passed with 1 error. That in itself seems odd. Do I go back and subtract out the number of labels with dups from passed? Do we have a new summary section called "Global Validation Summary"? I like the latter because just as reference integrity does not bleed into product neither should global. What is your preference?

@jordanpadams
Copy link
Member

@al-niessner so if we are going through the files and we read urn:nasa:pds:dart_teleobs:data_mroraw:m221011.0014::1.0 and see that it references comps_221011.png, then any products that come along that follow should fail (let's assume the first one is the "right one" for reporting purposes.

So as you read urn:nasa:pds:dart_teleobs:data_mroraw:m221011.0013::1.0 and see that it references comps_221011.png, it should fail, and so on.

Is it possible for us to do that?

@github-project-automation github-project-automation bot moved this from Backlog to 🏁 Done in EN Portfolio Backlog Nov 29, 2023
@github-project-automation github-project-automation bot moved this from Release Backlog to 🏁 Done in B14.1 Nov 29, 2023
jordanpadams added a commit to NASA-PDS/ldd-manager that referenced this issue Dec 19, 2023
Per [this new requirement implemented in validate](NASA-PDS/validate#755), having many labels point to the same data file will cause validate to fail.

When we are testing with LDDs, we don't care about that, so let's validate each file individually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

5 participants