-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a user, I want validate to throw an error when a file is being referenced by more than one label #755
Comments
@tbarnes4 can you provide the location within the standards where this rule is described? |
@jordanpadams Funny about that. I was trying to find it in the standards, and the closest I got was Section 3 Labels, par.5 starting with "Under PDS4, all product labels ...", that says there should be only one label for each product. I cornered Anne and she confirmed to me that each object can only be described once, not multiple times. She did ponder though how the registry would handle multiple xml files pointing to the same file object. |
@tbarnes4 Copy. So to clarify, if you are a data provider, the better route for handling this supplemental file would have been to describe it as some other data object (e.g. Product_Browse) and then reference it from the
The Registry doesn't care. It will just load the metadata and the pointer, and it will have a file_ref with a URL to the supplemental file. |
I appreciate what is being requested and can partially support it. It will require a sweeper as well to make it complete. For instance, if validate fails with a.xml and b.xml when given in a single command, there is nothing to stop the user from passing both with validate a.xml then validate b.xml - not implying maliciousness but just need to get it done to meet a deadline or just wrote the script to do one xml at a time. The only way to truly prevent two labels pointing at the same file is to do a partial check in validate, then do it again at harvest, then do it again with a sweeper for those that evade the fool-proof previous checks. Does this apply to LID or LIDVID? In other words, can urn::1.0 point at the same file as urn::1.1? |
@al-niessner 100% agree on the sweeper. we eventually want to provide much more comprehensive referential integrity checks behind the scenes. for now, we just want to do this check on the file system.
Great question. @tbarnes4 any idea if new versions of a label can point to the same product? for instance, what about when someone updates a label, but doesn't touch the data? it seems silly to have to create an entirely new version of the product file. |
Yes, this should be fine and validate should not flag this as an issue. As @jordan mentioned concerning label only changes will happen a lot. But in this case, I would assume (back to this again), that within a collection version or latest bundle version, a product LID would only ever be used once. Would or should there ever be case where a LID is used more than once (but different LIDVID) within a single collection version? I would imagine it should never be the case (but what is stopping that?). And I think that that is where the scope of validate (at this time) should end, within a collection version. Thinking out loud here, so if I validate a collection or a bundle or folder (?), validate should expect that each file has only one LID associated with it per collection LIDVID (in the case a folder or bundle contains multiple versions of a collection within its directory structure). Thinking of other possible situations, is it possible a file in a future collection version would have a different LID pointing to a file? For instance in collection vX.0 LID-A points to file M and file N, but in collection vY.0, LID-A no longer points to file N, but LID-B now points to file P and file N. How would validate resolve this situation if it was validating a folder or bundle that had both vX.0 and vY.0 of the collection? Sanity question, does validate check that there is only one case of LID (LIDVID too?) being part of a collection, in other words a duplicate LID usage? Is anything stopping a bundle version from having a collection listed multiple times but with multiple versions (LIDVIDs)?
If validate is asked to validate a single product, I do not think it should query the universe to see if another label is pointing to the same file on my local file system (on the cloud is an interesting question though). This kind of check should really only be done as a kind of referential integrity check of a collection or bundle and perhaps only warn users (at most) if they are validating multiple products at once outside the collection/bundle checks. |
Nevermind, I think I can think of a use case (haven't check standards yet to know if this is allowed or not). In the case where we are migrating data sets from PDS3 where it is important to preserve multiple versions of a document. Instead of creating multiple versions of a collection, just include all versions of the LID within the first released collection. So in this case we could have multiple instances of a LID, but no duplicate LIDVID. But this case is different from the thread above because in this instance, we would have multiple different document files, not a single version of a document with multiple labels. |
copy. so to clarify, we are saying that multiple versions of a LID can reference the same data file, but 2 different LIDs cannot reference the same file. |
Yes, that is what I'd lean towards. Thanks for talking/thinking it through with me. |
One more complication that may or may not be realistic: Collection/Bundle reference via LID labels a.xml and b.xml. a.xml version 1,0 -> 2.1 inclusive has a file area for snafu.csv. The decision is to then move snafu.csv from a.xml and b.xml making them versions 3.0 and 2.0 respectively. Since the bundles and collections only use the LID, there is no way other than checking the last only to do the processing. In other words, when a LID is given instead of a LIDVID, always use the latest version. Does this produce the desired behavior? |
I don't know. I would expect validate to check all files and all LID versions, as limited by what is provided to validate and what options are invoked with validate. I cannot foresee a situation where within a single collection version, where there would be multiple VIDs for a single LID, AND snafu.csv is moving from one LID to another LID. In this case we would have to ensure that each collection version has only one LID per object file. Correct me if I'm wrong, but collections must use LIDVIDs in their inventory file. Only bundles have the option of providing LIDs or LIDVIDs for collections, though I thought we were in the process of forcing bundles only use LIDVIDs of collections. |
@al-niessner @tbarnes4 if someone points at a specific bundle.xml with the pds4.bundle rule enabled, I agree if they reference a collection by LID we should just look at the latest versions of the collections and check from there. Past versions of the collections shouldn't matter in this context. If someone wants to check past versions of a collection, they can point at that collection and use the pds4.collection rule. |
@jordanpadams Should NASA-PDS/validate#245 be linked here (to NASA-PDS/validate#755)? |
After plotting a strategy yesterday, it mostly worked today with some tactical adjustments. It is leaving me with a really bit problem. How to report the error meaningfully. Cannot do the entire job until all file areas have been found. Since there is no way to know in validating a label what the last label will be, had to make the check global (post apache chain stuff). Here is what the summary looks like for `validate -t m221011.0013.xml m221011.0014.xml m221011.0030.xml':
The error is this:
The problem I cannot resolve it how to signal the failure. The summary says 3 passed with 1 error. That in itself seems odd. Do I go back and subtract out the number of labels with dups from passed? Do we have a new summary section called "Global Validation Summary"? I like the latter because just as reference integrity does not bleed into product neither should global. What is your preference? |
@al-niessner so if we are going through the files and we read So as you read Is it possible for us to do that? |
Per [this new requirement implemented in validate](NASA-PDS/validate#755), having many labels point to the same data file will cause validate to fail. When we are testing with LDDs, we don't care about that, so let's validate each file individually.
Checked for duplicates
Yes - I've already checked
π§βπ¬ User Persona(s)
All the above; anyone that is validating collections and bundles or a directory of products.
πͺ Motivation
According to the standards a file should only be described once by a label, and then referenced with a LID if multiple files want to point to it. Currently validate does not check for this. We recently found a case in the dart_teleobs:data_mroraw where a single file was described by a File_Area_Observational_Supplemental within multiple xml labels. No one thought to check for this, but I would hope this would be easy to check for this, especially since we already check to see if each file has a label.
Very similar in nature to #245
π Additional Details
Example: https://pdssbn.astro.umd.edu/holdings/pds4-dart_teleobs:data_mroraw-v1.0/mro_221011/
The file "comps_221011.png" is listed in the File_Area_Observational_Supplemental for 30 files in this directory, the first three of which listed below:
Acceptance Criteria
Given a set of label files where >1 of them references the same file from
File_Area_*
When I perform
validate --target product1.xml product2.xml etc.
Then I expect validate to throw an error
βοΈ Engineering Details
When keeping inventory of the files that have been referenced, we can drop the inventory when we move to another directory.
I&T
TestRail Test ID: T8681202
The text was updated successfully, but these errors were encountered: