Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It should be clear which version is a CSIP/SIP/AIP/DIP compatible with #703

Open
3 tasks
luis100 opened this issue Jul 5, 2023 · 14 comments
Open
3 tasks
Assignees

Comments

@luis100
Copy link

luis100 commented Jul 5, 2023

Related to comment keeps/commons-ip#166 (comment)

When validating the CSIP/SIP/AIP/DIP is difficult to know against which version of the specification we should validate against, or at least to know for which version of the specification a package was created for.

  • The CSIP/SIP/AIP/DIP METS profiles are not versioned, so they point to the latest version. If a profile changed, a CSIP that was created for a previous profile will now point to a different profile.
  • To my knowledge, it is not stated anywhere if backwards compatibility within a major version will be assured. This would be a good value to add.
  • Even if we assure backwards compatibility between minor versions of the specification, sooner or later we will need to break compatibility with a major version, as we have done with E-ARK CSIP 1.0 to 2.0. But there is no clear and obvious way to state which major version a E-ARK CSIP was created for. This should be clear, and I recommend some element/attribute or METS profile versioning to be included in the next major version (3.0).
@luis100
Copy link
Author

luis100 commented Jul 5, 2023

Use cases:

  1. A SIP package was created for specification version 2.1.0, and it is not compatible with 2.0.4. This package is to be submitted into an archive which supports 2.0.4 but not 2.1.0.
  2. A SIP package was created for specification version 2.1.0 but it is also compatible with 2.0.4. This package is to be submitted into an archive which supports 2.0.4 but not 2.1.0.

@karinbredenberg
Copy link
Contributor

That is usually achieved by having the profile pointer go to the correct version of the profile. So instead of pointing to the general one pointing to: https://earkcsip.dilcis.eu/profile/E-ARK-CSIP-v2-0-4.xml
The METS_Profile:@id have been used to give the version number.
For the different CITS there are version numbers included in the values used in CSIP4.

@prettybits
Copy link

While that would be one way to mark the intended version this doesn't seem to be reliably possible at the moment as I can't find an equivalent versioned document URL for the profiles of v2.1.0 of the specifications? Only "archived" versions of the profiles of previous versions of the specifications get versions in their names but the current version would need to have that as well. It would be good to have that and a general requirement to use stable URLs as part of CSIP6.

On the side of the profile itself, is it documented anywhere that the @ID attribute of the METS_Profile root element should be used to hold the version? METS Profiles as currently documented don't say anything about intended values besides specifying it as a generic xml:id and there aren't any explicit provisions about versioning as stated in the section on Related profiles:

A profile may indicate its relationship with other METS profiles. METS profiles are not explicitly versioned, as implementations may exist that use older editions of METS profiles. Therefore a new version of a profile must be registered as a new profile. In this case, the RELATIONSHIP attribute should be used to indicate that a profile supersedes a profile already registered with the Library of Congress Network Development and MARC Standards Office. For each related profile, the profile should specify a URI for the related profile and the nature of the relationship between the current profile and the related profile.

The URI element in the profiles also currently doesn't use versioned URLs regardless of URL name, as a consequence of the aforementioned:

<URI LOCTYPE="URL" ASSIGNEDBY="local">https://earkcsip.dilcis.eu/profile/E-ARK-CSIP.xml</URI>

This is equally the case for the vocabularies which I think should be thought of as bound to particular versions as well?

For the different CITS there are version numbers included in the values used in CSIP4.

Some of the allowed values contain version strings, I assume the other entries at the beginning are all legacy and shouldn't be used for new packages? If additional (validation) resources are available (e.g. an XML schema for the CITS ERMS v2.1.0) the same considerations should apply I believe.

@luis100
Copy link
Author

luis100 commented Jul 24, 2023

In my opinion, pointing to patch version does not make much sense, but the specific version should be stated.
Just noting that PREMIS schema seems to do a good job with with namespace and version.

https://www.loc.gov/standards/premis/v3/premis-v3-0.xsd

The namespace is "http://www.loc.gov/premis/v3" which identifies the major version, and we assume schemas will maintain retrocompatibility within the major version. Also PREMIS states the version on an attribute.

<xs:complexType name="premisComplexType">
<xs:sequence>
<xs:element ref="object" maxOccurs="unbounded"/>
<xs:element ref="event" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="agent" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="rights" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="version" type="version3" use="required"/>
</xs:complexType>

<!-- 
************** version definition
 -->
<xs:simpleType name="version3">
<xs:restriction base="xs:string">
<xs:enumeration value="3.0"/>
</xs:restriction>
</xs:simpleType>

The same could be accomplished in METS profile, pointing to the METS profile major version and adding an attribute to state the specific version used.

In the meantime @karinbredenberg solutions could be used, but as @prettybits states, we should published the latest version under its own version name and add information about the intended use in the specification.

@carlwilson
Copy link
Collaborator

The specifications have a fair number of "moving parts" some of which are versioned, some of which are not. Even with the existing state of play, this causes problems as:

  • There is no definitive way of indicating the version of the E-ARK specification information package used.
  • There is no versioning of the supporting schema and vocabularies, making change control impossible.

The official registry of METS profiles does not support profile versioning. See https://www.loc.gov/standards/mets/profile_docs/mets.profile.v2-0.html#related_profile. This means the profile has no convenient attribute to record the version.

E-ARK uses the optional ID attribute of the root element, which isn't used consistently:

  • CSIP <METS_Profile ID="2.1.0">
  • SIP <METS_Profile ID="SIPV2.1.0">
  • DIP <METS_Profile ID="DIPV2.1.0">

Old versions of the IP specifications are archived and made available under versioned file names. Using the CSIP as an example:

This approach has been introduced as needed and is inconsistent, at least where the current version is concerned.

The versioning of METS Profiles is made trickier by their lack of support for version numbering. Creating an extension to the profile schema to do so seems an extreme solution. Continuing to use the ID attribute is the simplest solution, but it should be used consistently. The proposal is that all specifications record ONLY the version number in the root METS_Profile element ID attribute: <METS_Profile ID="V2.1.0">. Parsers of the version number should be agnostic to case, and indeed, any prefix due to the existing instances. The version number should be a valid XML ID value.

The URI/URL for the profiles should reflect the version number in the URL path, as should every other resource. This should be consistently applied across all URL forms unless there is a compelling reason not to do so. A proposed style for the URL is to version for the major and minor numbers only.

PROPOSED CHANGES

All profiles to use the root element ID attribute consistently:

  • CSIP <METS_Profile ID="V2.1.0">
  • SIP <METS_Profile ID="V2.1.0">
  • DIP <METS_Profile ID="V2.1.0">

Treatment of the leading v should not take case into account, so for the above examples v2.1.0 and V2.1.0 should parse successfully. In practise it will be necessary to remove all alpha prefixes due to the existing longer ones that include the package type, e.g. SIPV2.1.0.

The URI/URL for the profiles should reflect the version number in the URL path, as should every other resource. This should be consistently applied across all URL forms unless there is a compelling reason not to do so. A proposed style for the URL is to use the explicit version in the filename part of the path and add a major version to the path part, similar to PREMIS listed by @luis100 above.

https://earkcsip.dilcis.eu/profile/v2/ will resolve to a directory listing showing all the current version 2 profiles.

Each version of a profile will reside under the appropriate major version and include the full version number in the file name, e.g. below the v2 directory above:

ALL references to a profile must include a full version number to avoid ambiguous references.

The supporting extension schema, vocabularies and vocabulary RNG schema will be versioned in an identical manner, e.g.

While we don't have different versions of these the changes will be made to retain consistency and open the possibility of future new versions. The extension schema namespace URI WILL NOT be versioned.

The vocabularies will be versioned as follows:

and so on. For completeness, the underlying vocabulary RNG schema will be versioned as well. There are currently no plans to amend these.

@karinbredenberg
Copy link
Contributor

karinbredenberg commented Sep 8, 2023

The issue is going to be discussed by the DILCIS Board

@shsdev
Copy link
Contributor

shsdev commented Nov 17, 2023

Additional comments received from @prettybits during the Salamanca event.

It should be clear exactly which version of METS Profiles was used. For this purpose, it would be good to have a standalone attribute for the METS Profile ID. The profiles should also always state the exact version number; a "latest" URL should not be used in a METS XML, as it will then be unclear - especially after a while - which version was used.

@jmaferreira
Copy link

jmaferreira commented Nov 21, 2023

Hi everyone,

I read through all of the comments and I have to say that I'm a bit lost here. In my understading this topic has to do with choosing the correct package validation procedure. When processing an information package (IP) we need to parse the METS file placed in the root folder of the IP, and to acomplish that we need to know by which rules it should be validated.

This means that the METS file should clearly identify the METS Profile it adheres to so that the right validator/parser is invoked by the validation process.

The way this is done is by specifying the value of the <mets @PROFILE> attribute on the METS file and not by making changes to the METS Profile itself.

In the following example found on the SIP GitHub repository (https://github.com/DILCISBoard/E-ARK-SIP/tree/rel/v2.1.1/examples) we can see that the profile that is identified to validate this METS file is http://www.dasboard.eu/specifications/sip/v03/METS.xml (obsolete BTW).

<mets 
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
	xmlns="http://www.loc.gov/METS/" 
	xmlns:xlink="http://www.w3.org/1999/xlink" 
	OBJID="a46ab3d0-c710-4d73-b58d-e93e30b53a80" 
	TYPE="ERMS" 
	xlink:CONTENTTYPESPECIFICATION="SMURFERMS" 
	PROFILE="http://www.dasboard.eu/specifications/sip/v03/METS.xml" 
	xsi:schemaLocation="http://www.loc.gov/METS/ schemas/mets.xsd" 
	LABEL="root level METS file for an IP">
	<metsHdr CREATEDATE="2017-01-31T13:07:22.6970809+02:00" RECORDSTATUS="NEW" LASTMODDATE="2017-01-31T13:07:22.6970809+02:00">
...

I would argue that this should be used as the token to identify the right METS parser. I don't think it should be necessary to get the linked METS profile and inspect its ID to get the token that we need.

That being said, I believe that having an well defined @id attribute on the METS profile is a good idea and something we should work on, I just don't think it is mandatory to fulfil the original goal of this issue.

The METS file generators out there should be enhanced to crearly identify the METS profile URL. The existing guides and specifications should stress this in their text.

@luis100 Am I missing something here?

@luis100
Copy link
Author

luis100 commented Nov 21, 2023

I think you are missing some aspects. One is that profiles were not being published with the version explicit, they were renamed with the version only when archived (i.e. when a new version came out), so the URL for the latest version was always versionless, which creates issues. This is a the profile/spec publication part of the issue.

Also, the METS Profile version and "METS parser", or generally the Specification version and Specification validator may not be one-to-one. There are changes in the specification and in the METS Profile that should be versioned but that don't necessarily create a new validator implementation. Generally, PATCH versions should not require changes in the validator implementation.
So, we may point to the MINOR version, generally instead of pointing to https://earkcsip.dilcis.eu/profile/v2/E-ARK-CSIP-v2-0-4.xml we could point to https://earkcsip.dilcis.eu/profile/v2/E-ARK-CSIP-v2-0.xml and have the profile itself keep its PATCH version and changelog.

The other issue is the "flavour" we are validating, are we validating CSIP, SIP, AIP, CITS-SIARD, CITS-ERMS? Which versions of which? We want to do an hierarchical validation structure, but how do we identify the versions consistently. For example, CITS-SIARD needs to use the PROFILE https://citssiard.dilcis.eu/profile/E-ARK-SIARD-ROOT.xml, which does not have a version, nor identifies if it is a SIP/AIP/DIP nor what spec version of those it is.
image (61)

Finally, a side note, all implementation suggestions here point to including this in the METS profile URL that is inside the METS file. Please note that knowing the profiles and versions is necessary to select the piece of code that will validate this file, so we will need to read the file twice, one for getting the package type and spec version (although this could be an empty file or malformed XML file), and another to actually do the validation. Any format detection tool (like Droid/PRONOM) will also need to use this information to identify the format and version of the file (if ZIPPED).

@karinbredenberg
Copy link
Contributor

@carlwilson please look at this

@jmaferreira
Copy link

@karinbredenberg In order to vote for an approach on this topic, we really need to nail down the options available for voting. We will not have time during the DILCIS Board meeting to come up with ways of addressing this issue. The options available for voting should be clear to everyone.

@karinbredenberg
Copy link
Contributor

karinbredenberg commented Jan 9, 2024

The suggestion is in Carl's description:

The MEtsprofiles filename are stright away named with the version number and a folder structure where the main number is the the folder for all subversions. The ID attribute in the mets element is used for all. SAme to be implemented for extension schema and vocabularies.

@karinbredenberg
Copy link
Contributor

karinbredenberg commented Jan 16, 2024

The suggestion is:

  • All METS profiles files are named with their version number
  • All METS profiles use the @ ID in its root element to give the version
  • A folder structure is created to easier facilitate finding the version
  • The extensions schema file is named with the METS profile version it supports
  • The extensions schema file is placed in a folder structure like the METS Profile
  • The calling on the extensions schema in the SchemaLocation in the METS document will need to point to the correct extension schema
  • The vocabularies files are named with the mets profile version they support
  • The vocabularies schema file is named with its version
  • The vocabularies schema is updated with an element to facilitate the addition of a version number of a vocabulary
  • CSIP6 gets an textual update stating that the version number must be part of the profile url that is given

Board members acknowledgment of the issue:
Tick the box in front of you name to indicate that you have looked at the suggestion.

  • Karin Bredenberg (Kommunalförbundet Sydarkivera, chair)
  • Anders Bo Nielsen (National Archives of Denmark)
  • Anja Paulič (National Archives of Slovenia)
  • Arne-Kristian Groven (National Archives of Norway)
  • Gregor Zavrsnik (Geoarh)
  • Janet Anderson (Highbury Research & Development Ltd.)
  • Maya Bangerter (Swiss Federal Archives)
  • Miguel Ferreira (KEEP Solutions)
  • Stephen Mackey (Penwern Limited)
  • Sven Schlarb (Austrian Institute of Technology)

Voting
(Decision making will be carried out on the basis of majority voting by all eligible members of the Board. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.

  • Karin Bredenberg (Kommunalförbundet Sydarkivera, chair)
  • Anders Bo Nielsen (National Archives of Denmark)
  • Anja Paulič (National Archives of Slovenia)
  • Arne-Kristian Groven (National Archives of Norway)
  • Gregor Zavrsnik (Geoarh)
  • Janet Anderson (Highbury Research & Development Ltd.)
  • Maya Bangerter (Swiss Federal Archives)
  • Miguel Ferreira (KEEP Solutions)
  • Stephen Mackey (Penwern Limited)
  • Sven Schlarb (Austrian Institute of Technology)

@carlwilson carlwilson self-assigned this Jan 17, 2024
@karinbredenberg
Copy link
Contributor

7 DILCIS Board members have acknowledge the issue
6 DILCIS Board members agree with the solution

The ssuggested solution will be included in the next version.

@karinbredenberg karinbredenberg added this to the CSIP version 2.2 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

6 participants