Skip to content

acdh-oeaw/arche-metadata-crawler

Repository files navigation

Metadata Crawler

Latest Stable Version Build status Coverage Status License

Functionality

A set of scripts:

  • Merging metadata of a collection from inputs in various formats
  • Validating the merged metadata
  • Generating XLSX metadata templates based on the current ontology (see the horizontal metadata files in metadata formats description)

used for the metadata curation during ARCHE ingestions.

Installation

Locally

  • Install PHP and composer
  • Run:
    composer require acdh-oeaw/arche-metadata-crawler

As a docker image

  • Install docker.
  • Run the acdhch/arche-ingest image mounting your data directory into it:
    docker run --rm -ti --entrypoint bash -u `id -u`:`id -g` \
               -v pathToYourDataDir:/data \
               acdhch/arche-ingest
  • Run the scripts, e.g.
    arche-create-metadata-template /data all
    and
    arche-crawl-meta \
      /data/metadata \
      /data/merged.ttl \
      /ARCHE/staging/GlaserDiaries_16674/data \
      https://id.acdh.oeaw.ac.at/glaserdiaries
    
    • if you need the file-checker, you can just run it with arche-filechecker

On ACDH Cluster

Nothing to be done. It is installed there already.

Usage

(For a full walk-trough using arche-ingestion@acdh-cluster and the Wollmilchsau test collection please look here)

On ACDH Cluster

First, get the arche-ingestion workload console as described here

Then:

  • Generate and validate the metadata:
    • Run the arche-crawl-meta script:
      /ARCHE/vendor/bin/arche-crawl-meta \
        <pathToMetadataDirectory> \
       --filecheckerReportDir <pathToTheFileCheckerReportDirectory> \
        <outputTtlPath> \
        <basePathOfTheCollection> \
        <idPrefix> \
        2>&1 | tee <pathToLogFile>
      e.g.
      /ARCHE/vendor/bin/arche-crawl-meta \
        /ARCHE/staging/GustavMahlerArchiv_22334/metadata \
        --filecheckerReportDir /ARCHE/staging/GustavMahlerArchiv_22334/checkReports/2024_04_08_09_19_24 \
        /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.ttl \
        /ARCHE/staging/GustavMahlerArchiv_22334/data \
        https://id.acdh.oeaw.ac.at/GustavMahlerArchiv \
        2>&1 | tee /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.log
      • If you are want to skip the checks (which speeds up the process significantly), add the --noCheck parameter, e.g.
        /ARCHE/vendor/bin/arche-crawl-meta \
          /ARCHE/staging/GustavMahlerArchiv_22334/metadata \
          --filecheckerReportDir /ARCHE/staging/GustavMahlerArchiv_22334/checkReports/2024_04_08_09_19_24 \
          /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.ttl \
          /ARCHE/staging/GustavMahlerArchiv_22334/data \
          https://id.acdh.oeaw.ac.at/GustavMahlerArchiv \
          --noCheck \
          2>&1 | tee /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.log
        
  • Create metadata templates:
    /ARCHE/vendor/bin/arche-create-metadata-template \
      <pathToDirectoryWhereTemplateShouldBeCreated> \
      all
    e.g. to create templates in the current directory
    /ARCHE/vendor/bin/arche-create-metadata-template . all

Locally

  • Generating and validaing the metadata:
    vendor/bin/arche-crawl-meta \
      --filecheckerReportDir pathToDirectoryWithFilecheckerOutput \
      pathToInputMetadataDir \
      mergedMetadataFilePath \
      pathToCollectionData \
      pathToTargetMetadataFile
    e.g.
    vendor/bin/arche-crawl-meta \
      --filecheckerReportDir reports/2024_03_01_12_45_23 \
      metaDir \
      metadata.ttl \
      `pwd`/data \
      https://id.acdh.oeaw.ac.at/myCollection
  • Creating metadata templates:
    vendor/bin/arche-create-metadata-template \
      <pathToDirectoryWhereTemplateShouldBeCreated> \
      all
    e.g. to create templates in the current directory
    bin/arche-create-metadata-template . all

Remarks:

  • To get a list of all available parameters run:
    vendor/bin/arche-crawl-meta --help
    vendor/bin/arche-create-metadata-template --help

As a docker container

  • Generating and validaing the metadata:
    Run a container mounting directory structure inside the container and overridding the command to be run with the arche-crawl-meta:
    docker run \
      --rm -u `id -u`:`id -g`\
      -v pathInHost:/mnt \
      --entrypoint arche-crawl-meta \
      acdhch/arche-ingest \
      --filecheckerReportDir pathToDirectoryWithFilecheckerOutput \
      pathToInputMetadataDir \
      mergedMetadataFilePath \
      pathToCollectionData \
      pathToTargetMetadataFile
    e.g. to use with pahts relatively to the current working directory
    docker run \
      --rm -u `id -u`:`id -g`\
      -v `pwd`:/mnt \
      --entrypoint arche-crawl-meta \
      acdhch/arche-ingest \
      --filecheckerReportDir /mnt/reports/2024_03_01_12_45_23 \
      /mnt/metaDir \
      /mnt/metadata.ttl \
      /mnt/data \
      https://id.acdh.oeaw.ac.at/myCollection
  • Creating metadata templates:
    Run a container mounting directory where templates should be created under /mnt inside the container and overridding the command to be run with the arche-create-metadata-template:
    docker run \
      --rm -u `id -u`:`id -g`\
      -v pathToDirectoryWhereTemplateShouldBeCreated:/mnt \
      --entrypoint arche-create-metadata-template
      acdhch/arche-ingest \
      /mnt all
    e.g. to create the templates in the current directory
    docker run \
      --rm -u `id -u`:`id -g` \
      -v `pwd`:/mnt \
      --entrypoint arche-create-metadata-template \
      acdhch/arche-ingest \
      /mnt all