This is a collection of tools for extract-map-validate workflows.
- schema & documentation of the o2r metadata
- extract - collect meta information from files or session
- broker - translate metadata from o2r to third party schemas
- validate - check if metadata set is valid to the schema
- harvest - collect metadata from external sources via OAI-PMH
For their role within o2r, please refer to o2r-architecture.
o2r-meta is licensed under Apache License, Version 2.0, see file LICENSE. Copyright (C) 2016, 2017 - o2r project.
o2r meta is designed for python 3.6 and supports python 3.4+.
Installation steps
(1) Acquire python version 3.4+.
(2) Parts of o2r meta require the gdal module that is known for causing trouble when installed via PIP. Therefore it is recommended to prepare the installation like this:
sudo add-apt-repository ppa:ubuntugis/ppa -y
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable -y
sudo apt-get -qq update
sudo apt-get install -y python3-dev
sudo apt-get install -y libgdal1h
sudo apt-get install -y libgdal-dev
sudo apt-get build-dep -y python-gdal
sudo apt-get install -y python-gdal
export CPLUS_INCLUDE_PATH=/usr/include/gdal
export C_INCLUDE_PATH=/usr/include/gdal
and afterwards install gdal this way:
pip install GDAL==$(gdal-config --version | awk -F'[.]' '{print $1"."$2}')
Alternatively you can use a precompiled python wheel (note: these are inofficially provided) of the gdal module that fits your desired platform.
(3) Install the required modules:
pip install -r requirements.txt
Another way of installation is provided by the Dockerfile. Build it like this:
docker build -t meta .
And start the extractor (e.g.) like this:
docker run meta o2rmeta.py -debug extract -i extract/tests -o extract/tests -xo
- Current documentation as part of the ERC-SPEC (GitHub)
- Current structure dummy
- MD of the erc configuration file
schema draft
When calling o2r meta, you can chose from the following commands, each representing one tool of the o2r meta suite: extract, validate, broker and harvest.
python o2rmeta [-debug] extract|validate|broker|harvest <ARGS>
Options:
debug
: option to enable verbose debug info where applicable
Each tool then has a number of required arguments:
python o2rmeta.py extract -i <INPUT_DIR> -s|-o <OUTPUT_DIR> [-xo] [-m] [-xml] [-ercid <ERC_ID>]
Example call:
python o2rmeta.py -debug extract -i extract/tests -o extract/tests -xo
Explanation of the switches:
-i
<INPUT_DIR> : required starting path for recursive search for parsable files-s
: option to print out results to console. This switch is mutually exclusive with-o
. At least one of them must be given-o
<OUTPUT_DIR> : required output path, where data should be saved. If the directory does not exist, it will be created on runtime. This switch is mutually exclusive with-s
. At least one of them must be given.-xo
: option to disable http requests (the extractor will stay offline. This disables orcid retrieval, erc spec download, doi retrieval, ...)-m
: option to additionally enable individual output of all processed files.-xml
: option to change output format from json (default) to xml.-ercid
<ERC_ID>: option to provide an ERC identifier.-b
<BASE_DIR>: option to provide starting point directory for relative paths output
Feel free to open an issue for suggestions!
Current version:
file type | description | extracted part | status |
---|---|---|---|
(r session) | live extraction | memory objects | under evaluation |
.cdl/.nc | NetCDF | geometry | under evaluation |
.csv/.tsv | seperated values | column headers | planned |
.geojson/.json | GeoJSON | geometry | WIP |
.gpkg | OGC GeoPackage | geometry | planned |
.jp2 | JPEG2000 | geometry | planned |
.py | python script | all | planned |
.r | R Script | all | implemented |
.rmd | R-Markdown | all | implemented |
.shp | Esri shapefile | geometry | implemented |
.tex | LaTeX | header | planned |
.tif(f) | geo TIFF | geometry | planned |
.yml | YAML | metadata | planned |
bagit.txt | BagIt | metadata | implemented |
... | ... | ... | ... |
The broker has two modes: In mapping mode, it creates fitting metadata for a given map by following a translation scheme included in that mapping file. In checking mode it returns missing metadata information for a target service or plattform, e.g. zenodo publication metadata, for a given checklist and input data.
The broker can be used to translate between different standards for metadata sets. For example from extracted raw metadata to schema-compliant metadata. Other target outputs might DataCite XML or Zenodo JSON. Translation instructions as well as checklists are stored in json formatted map files.
python o2rmeta.py broker -i <INPUT_FILE> -c <CHECKLIST_FILE>|-m <MAPPING_FILE> -s|-o <OUTPUT_DIR>
Example calls:
python o2rmeta.py -debug broker -c broker/checks/zenodo-check.json -i schema/json/example_zenodo.json -o broker/tests/all
python o2rmeta.py -debug broker -m broker/mappings/zenodo-map.json -i broker/tests/metadata_raw.json -o broker/tests/all
Explanation of the switches:
-c
<CHECKLIST_FILE> : required path to a json checklist file that holds checking instructions for the metadata. This switch is mutually exclusive with-m
. At least one of them must be given.-m
<MAPPING_FILE> : required path to a json mapping file that holds translation instructions for the metadata mappings. This switch is mutually exclusive with-c
. At least one of them must be given.-i
<INPUT_FILE> : path to input json file.-s
: option to print out results to console. This switch is mutually exclusive with-o
. At least one of them must be given.-o
<OUTPUT_DIR> : required output path, where data should be saved. If the directory does not exist, it will be created on runtime. This switch is mutually exclusive with-s
. At least one of them must be given.
Supported checks/maps
service | checklist file | mapping file | status | comment |
---|---|---|---|---|
zenodo | zenodo-check.json | zenodo-map.json | WIP | zenodo will register MD @ datacite.org |
eudat b2share | eudat-b2share-check.json | eudat-b2share-map.json | WIP | b2share supports custom MD schemas |
... | ... | ... | ... | ... |
Additionally the following features will be made available in the future:
- Documentation of the formal map-file "minimal language" (create your own map-files).
- Governing JSON-Schema for the map files (validate map-files against the map-file-schema).
python o2rmeta.py validate -s <SCHEMA> -c <CANDIDATE>
Example call:
python o2rmeta.py -debug validate -s schema/json/o2r-meta-schema.json -c schema/json/example1-valid.json
Explanation of the switches:
-s
: required path or URL to the schema file, can be json or xml.-c
: required path to candidate that shall be validated.
Collects OAI-PMH metadata from catalogues, data registries and repositories and parses them to assist the completion of a metadata set. Note, that this tool is currently only a demo.
python o2rmeta.py harvest -e <ELEMENT> -q <QUERY>
Example call:
python o2rmeta.py -debug harvest -e"doi" -q"10.14457/CU.THE.1989.1"
Explanation of the switches:
-e
: MD element type for search, e.g. doi or creator-q
: MD content to start the search