Skip to content

Latest commit

 

History

History
63 lines (34 loc) · 7.39 KB

mods.md

File metadata and controls

63 lines (34 loc) · 7.39 KB

Databus Mods

Introduction to Databus Mods

Databus Mods enhance the metadata of files registered through the Databus. These modular add-ons extend the core Databus system, conducting analyses and evaluations to provide detailed metadata. This includes data summaries, statistical information, and enriched descriptive metadata. The process of enrichment is driven by Linked Data technologies, ensuring a consistent, data-centric approach.

Adhering to the PROV-O model, Databus Mods generate additional metadata that is associated with the persistent identifiers of the Databus files, independent of their respective publishers. This metadata can be conveniently accessed via a SPARQL endpoint and an HTTP file server, enhancing the discoverability and accessibility of the enriched (meta)data.

Architectural Framework of Databus Mods

Databus Mods leverage a master-worker architectural model to automate the enrichment of Databus file metadata. The architecture consists of a Mod Master service and multiple Mod Workers.

The Mod Master service is responsible for monitoring the Databus SPARQL endpoint for updates, orchestrating activities for Mod Workers, aggregating the generated metadata, and maintaining a uniform storage. The Mod Workers, on the other hand, are instrumental in implementing the metadata model and providing an HTTP interface. This interface is utilized by the Mod Master to initiate a Mod Activity for a specific Databus file. The architecture is designed to facilitate concurrent operations of multiple Mod Workers of the same type, enhancing system throughput and scalability.

Implementing Databus Mods

Databus Mods present a robust platform for creating new applications that require access and processing of enriched metadata. Potential applications range from metadata-driven Databus file searches to intricate workflows integrating Databus Mods as subprocesses. By utilizing the enriched metadata, applications can offer enhanced file filtering, comprehensive sorting options, and detailed insights into file attributes, leading to more nuanced data utilization strategies.

Key Capabilities of Databus Mods

Databus Mods offer the following distinctive features:

  1. Link: Databus Mods establish a linkage between metadata and Databus files via a unified DataID extension, offering a metadata model that enables users or computer agents to enrich Databus file metadata uniformly.
  2. Automate: Databus Mods facilitate automatic generation of new metadata, enabling statistics and quality reports. They implement a system that automates the enrichment of new Databus files and the provisioning of generated metadata.
  3. Discover: Databus Mods provide a robust system that enables users to locate relevant Databus files by searching the (meta)data, enhancing discoverability.
  4. Build: Databus Mods enable the construction of new applications and workflows. They provide a platform for showcasing applications that leverage metadata generated by Databus Mods, serving as a proof of concept.

Reference Mod Workers

Usable as a template to create your own Mod workers:

  1. MIME-Type Mod Worker: This Mod Worker detects every Databus file's correct MIME-Type, decompresses the files if necessary, and sniffs on the file's byte stream using Apache Tika. The result is a Mod Result file containing RDF statements describing the compression format and (inner) MIME-Type of the analyzed Databus file. The linked MIME-Type resources are based on the IANA MIME-Type registry and described by a DataID vocabulary extension.

  2. File Metric Mod Worker: This Mod Worker counts non-empty lines, the number of duplicate lines, states if a file is sorted line-wise, and identifies its uncompressed byte size. These attributes serve as data selection criteria for several applications processing the files. The Mod Result files generated by this worker contain RDF describing these attributes.

  3. VoID Mod Worker: The VoID (Vocabulary of Interlinked Datasets) Mod is specifically oriented towards RDF files. The VoID Mod can only produce a subset of the possible VoID statistics, namely the property and class partitions. VoID is a popular metadata vocabulary used to describe the content of Linked Datasets. The VoID Mod worker can determine the frequency of usage of different classes and properties within the dataset. This metadata is then made available as RDF data itself, allowing it to be queried and utilized by other systems. This can be especially useful in data discovery, where one may want to find datasets containing specific classes or properties, or in data profiling, where one may want to understand the characteristics of a dataset before using it.

    The VoID Mod workers generate this metadata in a standardized form using the VoID vocabulary, ensuring it can be used and understood by any system capable of processing RDF data. In terms of implementation, the VoID Mod workers typically read the input RDF file line by line, parse each RDF triple, and increment counts associated with the used classes and properties.

    Please note that VoID provides a comprehensive vocabulary for describing Linked Datasets, and there is potential for the VoID Mod workers to be extended in the future to generate additional metadata beyond just property and class counts.

  4. SPO Mod Worker: The Subject Predicate Object (SPO) Mod is another RDF-oriented Mod. Its function is to count each URI's frequencies grouped by their occurrence as a subject, predicate, or object part of the triple. This information is then used to calculate further metadata for many processes.

###########################

TODO Marvin: Databus can be customized, by changing shacl, the webid and posting additional data. Please give some best practices, when to use this customization mechanism and when to use mods. I think, that if people have metadata that can not be generated from the file and is available to uploading agent, then that could be included, e.g. if they have own identifiers. Or they could limit licenses to CC or few open licenses only. Then also how do mods increase metadata quality (consistency is one aspect here, see e.g. the comments in byteSize)

Databus Mods

While the Databus Model is quite minimal and supports only necessary access metadata (e.g. download URL, shasum etc.) and basic documentation (title, description), Databus Mods provides a way of automatically enhancing files on the Databus with (meta)data using Linked Data technologies. In a nutshell, Databus Mods provide a service plus library for producing, persisting and linking files related to Databus Files (or actually running arbitrary code) when those are published. The metadata (or if necessary the data itself) from such a mod is published in an own SPARQL endpoint, making the fusion of Databus files with their additional (meta)data very easy, for example by using SPARQL's federated queries.

Existing Examples

There are currently some basic examples for Databus Mods, applicable to various file types, showcasing for what Databus Mods can be used:

  1. Mimetype Mod: On the publishing of any file, this mod finds the correspnding mimetype and saves it
  2. VOID Mod: Collects VOID metadata for RDF files and saves them in an SPARQL endpoint.
  3. Filemetrics Mod: Collects some addidional metrics not captured by the minimal model for any file, e.g. checking if it is sorted, the uncompressed size and some more

Use Cases

TODO