Revise data models and filter types #249

jesper-friis · 2023-03-26T13:55:14Z

The purpose of this issue is to get a clear picture of the data models and filter types provided by OTEAPI core. The intention is that is should result in a revision of the data model, strategies and documentation of OTEAPI core.

The main purpose of OTEAPI is to provide a structured way to document data sources and sinks and based on that, connect data consumers to data sources. The support for transformations should be seen as an added bonus.

Since OTEAPI core is agnostic to the underlying interoperability framework, it should not implement any filter type or strategy that depends on knowledge of the underlying interoperability framework (see the table below). Such strategies should be implemented in specialised OTEAPI plugins, like oteapi-dlite.

Filter types

The distinction between the terms filter type, which is conceptually and strategy, which is a specific implementation, should be emphasised.

The generic form for a partial pipeline documenting a data source is

access -> parse -> mapping

while a partial pipeline documenting a sink, it is

mapping -> generate -> deposit

Hence, there are 5 fundamental filter types:

access fetches data from a data source and places it as-is in the data cache such that it is available for the parse strategy. Should refer to the data source or data service with downloadUrl or accessUrl/accessService, respectively. Provides accessibility. The configuration may include other metadata such as keywords (for basic findability) as well as license, description, etc... (for basic reusability).
parse converts the external data representation to our internal data representation (which is based on a selected interoperability framework). Use mediaType to identify the external data representation. New common keywords (defined in the pipeline ontology) should be coined to identify the
internal data representation (DLite, SimPhoNy, MuPIF, ...) and the metadata scheme used within it. The parse strategy is identified by the combination of these the keywords (mediaType/interoperabilityFramework/metadataScheme). Provides interoperability.
mapping maps the internal representation to ontological concept. Identified by interoperabilityFramework. The mappingType could be kept optional, in the case there are several mapping strategy implementations for a given interoperability framework. Provides semantic interoperability.
generate is the opposite of parse. It converts the internal data representation to the external data representation, storing the serialised result in the data cache. Provides interoperability.
deposit is the opposite of access. It writes the serialised result from generate to a data sink, like a file or an online service. Provides accessibility.

In addition to these, we have:

resource, this is a composition of access and parse. Although breaking the conceptually clean separation between access and parse, there are use cases where this combined filter makes sense, for instance where you access an online service with a Python library that represents the query result with a Python object that is not serialisable to the data cache. In this case it is easier to directly create the internal data representation (using the underlying interoperability framework) and thereby eliminating the intermediate serialisation to the data cache. However, for data services that e.g. return a json payload, the separation up into access and parse is preferable, since that improves reusability and clearity.
filter. It's main purpose is to update/specialise the configuration of other filters. A typically usage is to specialise query parameters for an access or deposit filter. This way, one can have a fixed pre-configured partial access->parse->mapping pipeline documenting a database, while still be able specialise the query. A filter strategy typically designed to work together with a specific access or deposit strategy and should be independent of the underlying interoperability framework.
function, a synchronous transformation that run directly on the server hosting the OTEAPI services. A typical use is explicit conversion between different data models of the underlying interoperability framework.
transformation, an asynchronous transformation running in the background. Intended for long-running transformations. It would be beneficial if we could make transformation filters agnostic to the underlying interoperability framework, while still being semantic. The best way to accomplish that is to refer to existing (intop. framework-dependent) partial pipelines for documenting the transformation input and output. This could borrow many concepts from the WrapperSDK, with the difference of being independent of AiiDA.

The different filters are summarised in the table below:

Filter type	Identified by	IntOp. knowledge*	Level of data documentation	FAIR coverage
access	downloadUrl or accessUrl+ accessService	no	cataloguing	acceessibility (+ basic findability and reusability)
parse	mediaType+ interoperabilityFramework+ metadataScheme	yes	structural documentation	interoperability
mapping	interoperabilityFramework+ mappingType	yes	semantic documentation	(semantic) interoperability
generate	mediaType+ interoperabilityFramework+ metadataScheme	yes	structural documentation	interoperability
deposit	uploadUrl or accessUrl?+ accessService?	no	cataloguing	accessibility
resource	(downloadUrl or accessUrl+ accessService) and mediaType+ interoperabilityFramework+ metadataScheme	yes	cataloguing + structural documentation	acceessibility + interoperability (+ basic findability and reusability)
filter	filterType	no	-	-
function	interoperabilityFramework+ functionType	yes	-	-
transformation	transformationType	no?	-	-

*Whether the filter type has knowledge of/is dependent of the underlying interoperability framework.

The OTEAPI filter configurations cover three of the four levels of data documentation (cataloguing, structural documentation, contextual documentation and semantic documentation). The contextual data documentation is assumed to already exists in the associated knowledge base.

Backward compatibility

This issue suggests a few changes in OTEAPI core. These should be handled by avoiding breaking existing code, but by adding deprecation warnings such that we can remove the deprecated features in a years time.

Pipeline Ontology (or Data Documentation Ontology?)

Common keywords that are shared between the configurations for the different filter types should be defined in the Pipeline Ontology. Should we rename it to Data Documentation Ontology (DDO)? It should use and build on DCAT as much as possible and have a clear connection to EMMO.

Examples of additional concepts and data properties that should be defined in this ontology are:

InteroperabilityFramework (class)
interoperabilityFramework (data property)
MetadataScheme (class)
metadataScheme (data property referring to the IRI identifying the metadata scheme within a given interoperability framework. For DLite this would be a data model URI)
mappingType
uploadUrl (could be a sub-property of dcat:downloadUrl)
depositUrl (really needed, or should we just use dcat:accessUrl?)
depositService (really needed, or should we just use dcat:accessService?)
filterType
functionType
transformationType

The text was updated successfully, but these errors were encountered:

jesper-friis changed the title ~~Introduce Serialise and Upload strategies~~ Revise data models Apr 29, 2023

jesper-friis changed the title ~~Revise data models~~ Revise data models and filter types Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise data models and filter types #249

Revise data models and filter types #249

jesper-friis commented Mar 26, 2023 •

edited

Loading

Revise data models and filter types #249

Revise data models and filter types #249

Comments

jesper-friis commented Mar 26, 2023 • edited Loading

Filter types

Backward compatibility

Pipeline Ontology (or Data Documentation Ontology?)

jesper-friis commented Mar 26, 2023 •

edited

Loading