Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference-based IPs #747

Open
jmaferreira opened this issue Sep 4, 2024 · 2 comments
Open

Reference-based IPs #747

jmaferreira opened this issue Sep 4, 2024 · 2 comments
Assignees
Labels
Feature request This issue is a feature which will be implemented further on. Used together with a milestone.

Comments

@jmaferreira
Copy link

Reference-based IPs

The goal of this recommendation is to add the capability to build IPs that can transport metadata about content files without having to include the files themselves.

Essentially, the IP will only carry pointers or references to the actual files. This approach has been referred to as "Shallow IPs".

In theory, this functionality is already possible without altering the current implementation, but it would be beneficial to include a section in the CSIP detailing how to achieve this. This recommendation affects all information packages (SIP, AIP and DIP).

Use cases

One of the primary use cases supporting this recommendation concerns I/O logistics and performance. Pre-ingest is responsible for organising data into SIPs before submitting them to the archive. The traditional approach involves copying the data into a physically bound E-ARK SIP, i.e.. a package that includes all the referenced files within its physical boundaries.

This SIP is then submitted to the archive, which generates E-ARK AIPs. After verifying that the ingest was successful, the system must then remove the E-ARK SIP and the original data. Additionally, the system must handle ingest failures and the potential need to regenerate SIPs. This process can lead to several issues, most notably the requirement for significant storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP, and the AIP copies.

Moreover, the I/O operations needed to move the data around can place unnecessary stress on the archival system and storage infrastructure.

Instead of duplicating the data within the archival system, Shallow IPs allow for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage and minimizing the load on the repository’s I/O operations.

This approach ensures that data can be efficiently managed and accessed without unnecessary duplication, improving performance and reducing resource consumption. However, it also increases the risk of data loss, as the repository lacks full control over the data. In certain scenarios, though, this risk may be acceptable in exchange for reducing the use of expensive storage, e.g., when the pre-ingest area and the archival storage area share the same infrastructure.

Given that the OAIS repository is able to retrieve the file whenever it needs for any access or preservation operations the capacity of referring external content can greatly reduce the amount of local resources needed to manage a OAIS system or can allow modern storage systems to be used as a backend for storing E-ARK AIPs. It can also reduce the overall amount of storage needed for a institution when the content is both in the current information management system (e.g., ERMS) and also in the OAIS archive.

For example, a TV broadcast station may have its own archive, which demands significant storage space. While backups and remote replicas are already in place, they may wish to incorporate this data into an OAIS archive to benefit from enhanced preservation features (e.g. Representation Information). However, duplicating the storage or transferring the content into the OAIS archive, and modifying the production system to rely on the OAIS archive, would not be feasible due to the prohibitive costs of such strategies.

The option would be for the OAIS archive to refer to the content in the productive system, allow curators to create shallow SIPs and submit them to the OAIS archive. The OAIS archive would need to access the external content and execute ingest workflow validations.

Preservation actions like fixity checks, file format identification, file format validation, and file format conversion could be done using the external data as input as well. However, every outcome of these operations shall become a local files (including the outcome of the file format conversion actions). We recommend keeping all (descriptive, preservation, other) metadata local to the AIPs, but allow representation data to be external/remote.

CSIP principles 1.4 and 3.6 state that:

  • "the Information Package SHOULD be scalable" and that
  • "[...] it is clear that any given technical implementation will become obsolete in time, for example, as new transfer methods and storage solutions emerge. As such this specification does not prohibit the take-up of any emerging logical or physical technical solutions."

Taking these principles into account, we suggest a modification to the CSIP text (which would apply to SIP, AIP, and DIP) to allow representation data to be non-local.

We propose altering the following sections in the Common IP specification:

Section 4. CSIP structure

Update text from:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended.

The main reason for this implementation decision is that a fixed and documented folder structure makes the package layout clear to both human users and automated tools. The main benefit this clarity is that many archival tasks (e.g. file format risk analysis), can be executed directly on the data portion of the package structure, as opposed to first processing potentially large amounts of metadata for file locations. This allows for more efficient processing which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

Many data storage solutions do not explicitly support folder structures, but use other means for structuring and storing AIP data and metadata. However, the purpose of this specification is to facilitate and support Information Package interoperability. When storage solutions do not support the implementation of the package structure for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs. While systems need to implement transformations between Common Specification IPs and internal AIPs it allows interoperability between tools that support the common specification, easy transfer of IPs to new repository systems or storage solutions and the establishment of multi-repository duplicated storage solutions.

To:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended. However, it is recognised that reference-based packages, where files can be physically separated and referred to by links, are also viable.

The main reason for preferring a fixed and documented folder structure is that it makes the package layout clear to both human users and automated tools. This clarity is crucial as it allows many archival tasks (e.g., file format risk analysis) to be executed directly on the data portion of the package structure, without the need to first process potentially large amounts of metadata to locate files. This leads to more efficient processing, which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

While many data storage solutions do not explicitly support folder structures and may use alternative methods, such as Content Addressable Storage systems (CAS) that allow files to be stored separately and retrieved by a unique identifier (and not by its name and location), the purpose of this specification remains to facilitate and support Information Package interoperability. Even when storage solutions do not support the implementation of the folder structures for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs mainly for transport reasons.

Systems that use reference-based packages should ensure that they implement the necessary processes to consolidate referenced files into a local physically structure, enabling interoperability between tools that support the common specification. This also allows for the easy transfer of IPs to new repository systems or storage solutions and the establishment of multi-repository duplicated storage solutions.

Section 4.3 Implementing referenced-based information packages (NEW SECTION)

In certain implementation contexts, it is advantageous for the files included in an information package - whether it be an SIP, AIP, or DIP - to be referenced via links rather than physically bound together in a folder structure.

One of the primary use cases supporting this recommendation concerns I/O logistics and performance. Pre-ingest is responsible for organising data into SIPs before submitting them to the archive. The traditional approach involves copying the data into a physically bound E-ARK SIP, i.e.. a package that includes all the referenced files within its physical boundaries.

This SIP is then submitted to the archive, which generates E-ARK AIPs. After verifying that the ingest was successful, the system must then remove the E-ARK SIP and the original data. Additionally, the system must handle ingest failures and the potential need to regenerate SIPs. This process can lead to several issues, most notably the requirement for significant storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP, and the AIP copies.

Moreover, the I/O operations needed to move the data around can place unnecessary stress on the archival system and storage infrastructure.

Referenced-based IPs, also known as Shallow IPs, offer a solution by allowing for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage and minimising the strain on the repository’s I/O operations. This approach enables data to be managed and accessed without unnecessary duplication, leading to improved performance and reduced resource consumption. However, it is important to note that this approach carries an increased risk of data loss, as the repository does not maintain full control over the physical storage of the data and also reduced performance when data needs to be fetched from a low bandwidth communication channel. Despite this risk, in certain scenarios, the trade-off may be justified in order to reduce the use of costly resources.

Referenced-based IPs can significantly optimise file storage usage, particularly in scenarios where large volumes of data remain actively used by other systems that function as both processing and archival platforms.

For instance, in environments where data is stored in Content Addressable Storage (CAS) or cloud-based systems like Amazon S3 and OpenStack Swift, the advantages of Referenced-based IPs are evident. These systems inherently manage data through references or identifiers, making Referenced-based IPs an ideal solution for efficient storage and retrieval without the need for duplicating data before and after archiving.

Examples

Typically, the METS file included in an Information Package contains relative URLs that reference files included within the package itself.

Below is an example of an IP that references files available locally within the IP.

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500" MIMETYPE="application/xml"
    SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="schemas/xlink.xsd"/>
</file>

However, the METS specification supports several options for LOCTYPE, and the URL option supports the full expressiveness of the URL standard. To create a reference-based information package one should make use of this expressiveness and specify the full location of external files.

The following excerpt is an example of an IP that references files available remotely instead of stored in a local folder structure:

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500"
    MIMETYPE="application/xml" SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="https://www.w3.org/XML/2008/06/xlink.xsd"/>
</file>

Other protocols may also be used. The following example, depicts how files may be located on a shared drive, accessible via the NFS protocol:

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500"
    MIMETYPE="application/xml" SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="nfs://shared-host/schemas/xlink.xsd"/>
</file>
@jmaferreira jmaferreira added the Feature request This issue is a feature which will be implemented further on. Used together with a milestone. label Sep 4, 2024
@jmaferreira jmaferreira self-assigned this Sep 4, 2024
@adamfarquhar
Copy link

It is helpful to discuss this explicitly. The TV Broadcaster provides a good use case.

“referenced via links rather than physically bound together in a folder structure” - from the METS example, everything is referenced using a link (FLcat) – both files included in the IP as well as ones that are not. The difference is that the included files have a relative URL. Perhaps there is another way to make the difference clear.

The METS examples would be clearer and easier to understand if they referred to a real content file rather than an XML schema file. The differences would be clearer if they were highlighted or even if the text just showed the new href. As it is, the reader has to parse it carefully and confirming that the ID and CHECKSUMs are the same is not readily done by eye!

“place unnecessary stress on the archival system and storage infrastructure” – What does this mean? The text already mentions the cost of multiple copies of the data.

I think that the text may underplay the risks that can come from referencing the content. Take the TV Broadcaster, for example. Imagine that disk space is tight and a big event risks filling up available storage. It is easy to make an operational decision to delete files that seem unlikely to be used. Or a recording is embarrassing to a presenter and they put pressure on operations staff to delete it. Or the operational version is directly edited. It is also harder to ensure that reference-based copies are physically distinct.

Using remote/cloud-based storage that is part of the archival solution does not introduce the risks that you might see when using copies in an operational system.

“the package layout clear to both human users and automated tools” – as an aside, I guess that this is more about being a lowest-common-denominator technology (file-system with files and folders) and that easily using tools designed for another context.

A final editing pass should catch the minor errors in the very clear writing.

@shsdev
Copy link
Contributor

shsdev commented Sep 5, 2024

I believe the proposal is reasonable. I am familiar with requirements of this kind related to archive systems that use an S3 storage backend. However, regarding the AIP, I would suggest adding a subordinate requirement that it must always be possible to generate an all-inclusive AIP, which can be divided or split for very large AIPs. The reasoning behind this is that when using external storage for AIPs, it is then still possible to safeguard all-inclusive AIPs on more affordable long-term storage options, such as tape drives, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request This issue is a feature which will be implemented further on. Used together with a milestone.
Projects
None yet
Development

No branches or pull requests

3 participants