Skip to content

Latest commit

 

History

History
427 lines (317 loc) · 16.4 KB

README.md

File metadata and controls

427 lines (317 loc) · 16.4 KB

E-ARK IP validation and manipulation tool and library

This project provides a command-line interface and Java library to validate and manipulate OAIS Information Packages of different formats: E-ARK (version 1, 2.0.4, 2.1.0, 2.2.0), BagIt, and Hungarian type 4 SIP.

The E-ARK Information Packages are maintained by the Digital Information LifeCycle Interoperability Standards Board ( DILCIS Board). DILCIS Board is an international group of experts committed to maintaining and sustaining a set of interoperability specifications which allow for the transfer, long-term preservation, and reuse of digital information regardless of the origin or type of the information.

More specifically, the DILCIS Board maintains specifications initially developed within the E-ARK Project (02.2014 - 01.2017):

  • Common Specification for Information Packages
  • E-ARK Submission Information Package (SIP)
  • E-ARK Archival Information Package (AIP)
  • E-ARK Dissemination Information Package (DIP)

The DILCIS Board collaborates closely with the Swiss Federal Archives in regard to the maintenance of the SIARD ( Software Independent Archiving of Relational Databases) specification.

For more information about the E-ARK Information Packages specifications, please visit http://www.dilcis.eu/

Installation

Requirements

  • Java (>= 17)

Download the latest release to use as a tool or check below how to use it as a Java Library.

Usage

You can use the commons-ip as a command-line tool or a Java library.

Use as a command-line tool

To use the commons-ip command-line tool, need to download the latest release. This tool can validate a SIP or create a valid EARK2 SIP.

To validate an SIP have to use the following options:

  • validate, [REQUIRED] This option is for the CLI to know that is to perform the validation of a SIP.
  • -i, or --inputs [REQUIRED] Path(s) to the SIP(s) archive file or folder(s).
  • --specification-version, [OPTIONAL] E-ARK CSIP version (Default: 2.1.0).
  • -o, or --output-report-dir [OPTIONAL] Path to the Directory where you want to save the validation report.
  • -r, or --reporter-type [OPTIONAL] The type of report (possible values: commons-ip, eark-validator)
  • -v, or --verbose [OPTIONAL] Verbose option (Will print all validation steps)
  • -h, [OPTIONAL] Display this help and exit.

To create an EARK-2 SIP have to use the following options:

  • create, [REQUIRED] This option is for the CLI to know that is to perform the creation of an EARK-2 SIP.

  • -doc or --documentation, [OPTIONAL] Path(s) to folders or files to add in the documentation of SIP.

  • -p or --path, [OPTIONAL] Path to save the SIP.

  • -a or --ancestors, [OPTIONAL] ID(s) of the ancestors of the SIP.

  • -C or --checksum, [OPTIONAL] Checksum algorithm (Default: SHA-256).

  • -d or --documentation, [OPTIONAL] Path(s) to documentation file(s).

  • -T or --target-only, [OPTIONAL] Adds only the files for the representations.

  • -v or --version, [OPTIONAL] E-ARK SIP specification version (Default: 2.1.0)

  • --submitter-name, [OPTIONAL] The name of the submitter agent.

  • --submitter-id, [OPTIONAL] The identification code (ID) of the submitter agent.

  • --sip-id, [OPTIONAL] ID of the SIP.

    • --override-schema, [OPTIONAL] Overrides default schema.
    • -s or --strategy, [OPTIONAL] Write strategy to be used (Default: Zip)

This is the descriptive metadata section:

  • --metadata-file [REQUIRED] Path to descriptive metadata file.
  • --metadata-type, [REQUIRED] Descriptive metadata type.
  • --metadata-schema, [OPTIONAL] Path to descriptive metadata schema file.
  • --metadata-version, [OPTIONAL] Descriptive metadata version.

NOTE: If does not give the metadata version, the tool tries to obtain these values from the file name in the following formats (file: ead_2002.xml -> result: metadata version: 2002)

This is the representation section:

  • --representation-data, [REQUIRED] Path to representation file.
  • --representation-id, [OPTIONAL] Representation identifier. If not set a default value of rep will be used.
  • --representation-type, [OPTIONAL] Representation type

Examples:

Full create SIP command with long options:

java -jar commons-ip-cli-2.X.Y.jar create --metadata-file metadata.xml --metadata-type ead --metadata-schema ead2002.xsd \
--representation-data "dataFile1.pdf,dataFolder1,dataFile2.png,Mixed,representation1" \
--sip-id sip1 --ancestors sip2,sip3 --documentation documentation1,documentationFolder --path folder2 --submitter-name agent1 --submitter-id 123
java -jar commons-ip-cli-2.X.Y.jar validate -i sip1.zip
java -jar commons-ip-cli-2.X.Y.jar validate -i sip1.zip sip2.zip -o output/

Output Example

The report generated by the validator is in JSON format and has the following structure:

{
  "header": {
    "title": "Validation Report CSIP",
    "specifications": [
      {
        "id": "CSIP-2.0.4",
        "url": "https://github.com/DILCISBoard/E-ARK-CSIP/releases/tag/v2.0.4"
      },
      {
        "id": "SIP-2.0.4",
        "url": "https://github.com/DILCISBoard/E-ARK-SIP/releases/tag/v2.0.4"
      }
    ],
    "version_commons_ip": "2.0.0-alpha3-SNAPSHOT",
    "date": "2021-09-24T13:30:26.203+01:00",
    "path": "${HOME}/Full-EARK-SIP.zip"
  },
  "validation": [
    {
      "specification": "CSIP-2.0.4",
      "id": "CSIPSTR1",
      "name": "CSIP Information Package folder structure",
      "location": "",
      "description": "Any Information Package MUST be included within a  single physical root folder (known as the “Information Package root folder”). For packages presented in an archive format, see CSIPSTR3, the archive MUST unpack to a single root folder.",
      "cardinality": "",
      "level": "MUST",
      "testing": {
        "outcome": "PASSED",
        "issues": [],
        "warnings": [],
        "notes": []
      }
    },
    {
      "specification": "CSIP-2.0.4",
      "id": "CSIP32",
      "name": "Digital provenance metadata",
      "location": "mets/amdSec/digiprovMD",
      "description": "For recording information about preservation the standard PREMIS is used. It is mandatory to include one <digiprovMD> element for each piece of PREMIS metadata. The use if PREMIS in METS is following the recommendations in the 2017 version of PREMIS in METS Guidelines.",
      "cardinality": "0..n",
      "level": "SHOULD",
      "testing": {
        "outcome": "FAILED",
        "issues": [],
        "warnings": [
          "It is mandatory to include one <digiprovMD> element in Root METS.xml for each piece of PREMIS metadata"
        ],
        "notes": []
      }
    }
  ],
  "summary": {
    "success": 120,
    "warnings": 3,
    "errors": 5,
    "skipped": 33,
    "notes": 8,
    "result": "INVALID"
  }
}

Available properties

Several properties are available to modify default behaviours.

Properties are added to the command line like this:

... -Dproperty=value -DanotherProperty=othervalue ...

Note: in Windows, each property and value pair must be enclosed in ", example ... "-Dproperty=value" ...

Validation
  • skipChecksumCalculation (Boolean) - Ignore checksum calculation during the validation process.

Use as a Java Library

  • Using Maven
  1. Add the following repository
  1. Add the following dependency
<dependency>
<groupId>org.roda-project</groupId>
<artifactId>commons-ip2</artifactId>
<version>2.6.0</version>
</dependency>

Write some code

  • Create a full E-ARK SIP
// 1) instantiate E-ARK SIP object
SIP sip=new EARKSIP("SIP_1",IPContentType.getMIXED(),IPContentInformationType.getMIXED());
        sip.addCreatorSoftwareAgent("RODA Commons IP","2.0.0");

// 1.1) set optional human-readable description
        sip.setDescription("A full E-ARK SIP");

// 1.2) add descriptive metadata (SIP level)
        IPDescriptiveMetadata metadataDescriptiveDC=new IPDescriptiveMetadata(
        new IPFile(Paths.get("src/test/resources/eark/metadata_descriptive_dc.xml")),
        new MetadataType(MetadataTypeEnum.DC),null);
        sip.addDescriptiveMetadata(metadataDescriptiveDC);

// 1.3) add preservation metadata (SIP level)
        IPMetadata metadataPreservation=new IPMetadata(
        new IPFile(Paths.get("src/test/resources/eark/metadata_preservation_premis.xml")));
        sip.addPreservationMetadata(metadataPreservation);

// 1.4) add other metadata (SIP level)
        IPFile metadataOtherFile=new IPFile(Paths.get("src/test/resources/eark/metadata_other.txt"));
// 1.4.1) optionally one may rename file final name
        metadataOtherFile.setRenameTo("metadata_other_renamed.txt");
        IPMetadata metadataOther=new IPMetadata(metadataOtherFile);
        sip.addOtherMetadata(metadataOther);

// 1.5) add xml schema (SIP level)
        sip.addSchema(new IPFile(Paths.get("src/test/resources/eark/schema.xsd")));

// 1.6) add documentation (SIP level)
        sip.addDocumentation(new IPFile(Paths.get("src/test/resources/eark/documentation.pdf")));

// 1.7) set optional RODA related information about ancestors
        sip.setAncestors(Arrays.asList("b6f24059-8973-4582-932d-eb0b2cb48f28"));

// 1.8) add an agent (SIP level)
        IPAgent agent=new IPAgent("Agent Name","OTHER","OTHER ROLE",CreatorType.INDIVIDUAL,"OTHER TYPE","",
        IPAgentNoteTypeEnum.SOFTWARE_VERSION);
        sip.addAgent(agent);

// 1.9) add a representation (status will be set to the default value, i.e.,
// ORIGINAL)
        IPRepresentation representation1=new IPRepresentation("representation 1");
        sip.addRepresentation(representation1);

// 1.9.1) add a file to the representation
        IPFile representationFile=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile.setRenameTo("data.pdf");
        representation1.addFile(representationFile);

// 1.9.2) add a file to the representation and put it inside a folder
// called 'def' which is inside a folder called 'abc'
        IPFile representationFile2=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile2.setRelativeFolders(Arrays.asList("abc","def"));
        representation1.addFile(representationFile2);

// 1.10) add a representation & define its status
        IPRepresentation representation2=new IPRepresentation("representation 2");
        representation2.setStatus(new RepresentationStatus(REPRESENTATION_STATUS_NORMALIZED));
        sip.addRepresentation(representation2);

// 1.10.1) add a file to the representation
        IPFile representationFile3=new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
        representationFile3.setRenameTo("data3.pdf");
        representation2.addFile(representationFile3);

// 2) build SIP, providing an output directory
        Path zipSIP=sip.build(tempFolder);

Note: SIP implements the Observer Pattern. This way, if one wants to be notified of SIP build progress, one just needs to implement SIPObserver interface and register itself in the SIP. Something like (just presenting some of the events):

public class WhoWantsToBuildSIPAndBeNotified implements SIPObserver {

  public void buildSIP() {
    ...
    SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED());
    sip.addObserver(this);
    ...
  }

  @Override
  public void sipBuildPackagingStarted(int totalNumberOfFiles) {
    ...
  }

  @Override
  public void sipBuildPackagingCurrentStatus(int numberOfFilesAlreadyProcessed) {
    ...
  }
}
  • Parse a full E-ARK SIP
// 1) invoke static method parse and that's it
SIP earkSIP=EARKSIP.parse(zipSIP);

Development

In this section are some relevant notes about Commons IP development.

XML Beans

XML Beans are used by Commons IP to manipulate METS files using Java code.

Details

Some changes were made to XML Schemas to be able to compile XML Schemas into Java classes using XJC as well as to be able to validate an XML file against its XML Schema without Internet connections.

The changes are:

  • src/main/resources/schemas2/mets1_12.xsd XLink Schema location made local (and respectively file available locally)
  • src/main/resources/schemas2/mets1_12.xjb Bindings file created to deal with attribute name conflict between METS and XLink Schemas

After Java classes were created, some changes were made to produce METS XML files well-defined in terms of namespaces. Namely:

  • src/main/java/org/roda_project/commons_ip2/mets_v1_12/beans/package-info.java Annotations for correctly generate namespaces were added. Following is the before & then the after:
@jakarta.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = jakarta.xml.bind.annotation.XmlNsForm.QUALIFIED)

and the after

@jakarta.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = jakarta.xml.bind.annotation.XmlNsForm.QUALIFIED, xmlns = {
  @jakarta.xml.bind.annotation.XmlNs(prefix = "", namespaceURI = "http://www.loc.gov/METS/"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "xsi", namespaceURI = "http://www.w3.org/2001/XMLSchema-instance"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "csip", namespaceURI = "https://DILCIS.eu/XML/METS/CSIPExtensionMETS"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "sip", namespaceURI = "https://DILCIS.eu/XML/METS/SIPExtensionMETS"),
  @jakarta.xml.bind.annotation.XmlNs(prefix = "xlink", namespaceURI = "http://www.w3.org/1999/xlink")})

IANA Media Types

The IANA Media Types list is required to perform SIP Validation. The list is located in the folder and named as follows:

/src/main/resources/controlledVocabularies/IANA_MEDIA_TYPES.txt

How to generate/update

To update IANA media types list, in commons-ip root directory run the following command:

./scripts/ianaMediaTypes_parser.sh

The command executes a script that downloads all IANA Media Types (Application, Audio, Font, Image, Message, Model, Multipart, Text, Video) from https://www.iana.org/assignments/media-types/${iana_file}.csv . Note that this downloads different .csv files then creates a .txt file with all IANA media types appended to the new file.

Extend List of IANA Media Types

The IANA media types list from https://www.iana.org/assignments/media-types/${iana_file}.csv was extended with the following mimetypes:

Image
  • image/jpeg
  • image/gif
Text
  • text/plain
Video
  • video/mpeg

Update Extended list of IANA Media Types

  • To add/remove new values to the list add a new line with the IANA Media Type in scripts/ExtendedIANAMediaTypes.txt

  • Run the script again

Commercial support

For more information or commercial support, contact KEEP SOLUTIONS.

Further reading

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

License and Intellectual Property

All contributions to this project are licensed on LGPL v3, which includes an explicit grant of patent rights, meaning that the developers who created or contributed to the code relinquish their patent rights with regard to any subsequent reuse of the software.

Credits

  • Hélder Silva (KEEP SOLUTIONS)

License

LGPLv3