Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise data models and filter types #249

Open
jesper-friis opened this issue Mar 26, 2023 · 0 comments
Open

Revise data models and filter types #249

jesper-friis opened this issue Mar 26, 2023 · 0 comments

Comments

@jesper-friis
Copy link
Contributor

jesper-friis commented Mar 26, 2023

The purpose of this issue is to get a clear picture of the data models and filter types provided by OTEAPI core. The intention is that is should result in a revision of the data model, strategies and documentation of OTEAPI core.

The main purpose of OTEAPI is to provide a structured way to document data sources and sinks and based on that, connect data consumers to data sources. The support for transformations should be seen as an added bonus.

Since OTEAPI core is agnostic to the underlying interoperability framework, it should not implement any filter type or strategy that depends on knowledge of the underlying interoperability framework (see the table below). Such strategies should be implemented in specialised OTEAPI plugins, like oteapi-dlite.

Filter types

The distinction between the terms filter type, which is conceptually and strategy, which is a specific implementation, should be emphasised.

The generic form for a partial pipeline documenting a data source is

access -> parse -> mapping

while a partial pipeline documenting a sink, it is

mapping -> generate -> deposit

Hence, there are 5 fundamental filter types:

  • access fetches data from a data source and places it as-is in the data cache such that it is available for the parse strategy. Should refer to the data source or data service with downloadUrl or accessUrl/accessService, respectively. Provides accessibility. The configuration may include other metadata such as keywords (for basic findability) as well as license, description, etc... (for basic reusability).
  • parse converts the external data representation to our internal data representation (which is based on a selected interoperability framework). Use mediaType to identify the external data representation. New common keywords (defined in the pipeline ontology) should be coined to identify the
    internal data representation (DLite, SimPhoNy, MuPIF, ...) and the metadata scheme used within it. The parse strategy is identified by the combination of these the keywords (mediaType/interoperabilityFramework/metadataScheme). Provides interoperability.
  • mapping maps the internal representation to ontological concept. Identified by interoperabilityFramework. The mappingType could be kept optional, in the case there are several mapping strategy implementations for a given interoperability framework. Provides semantic interoperability.
  • generate is the opposite of parse. It converts the internal data representation to the external data representation, storing the serialised result in the data cache. Provides interoperability.
  • deposit is the opposite of access. It writes the serialised result from generate to a data sink, like a file or an online service. Provides accessibility.

In addition to these, we have:

  • resource, this is a composition of access and parse. Although breaking the conceptually clean separation between access and parse, there are use cases where this combined filter makes sense, for instance where you access an online service with a Python library that represents the query result with a Python object that is not serialisable to the data cache. In this case it is easier to directly create the internal data representation (using the underlying interoperability framework) and thereby eliminating the intermediate serialisation to the data cache. However, for data services that e.g. return a json payload, the separation up into access and parse is preferable, since that improves reusability and clearity.
  • filter. It's main purpose is to update/specialise the configuration of other filters. A typically usage is to specialise query parameters for an access or deposit filter. This way, one can have a fixed pre-configured partial access->parse->mapping pipeline documenting a database, while still be able specialise the query. A filter strategy typically designed to work together with a specific access or deposit strategy and should be independent of the underlying interoperability framework.
  • function, a synchronous transformation that run directly on the server hosting the OTEAPI services. A typical use is explicit conversion between different data models of the underlying interoperability framework.
  • transformation, an asynchronous transformation running in the background. Intended for long-running transformations. It would be beneficial if we could make transformation filters agnostic to the underlying interoperability framework, while still being semantic. The best way to accomplish that is to refer to existing (intop. framework-dependent) partial pipelines for documenting the transformation input and output. This could borrow many concepts from the WrapperSDK, with the difference of being independent of AiiDA.

The different filters are summarised in the table below:

Filter type Identified by IntOp. knowledge* Level of data documentation FAIR coverage
access downloadUrl or accessUrl+ accessService no cataloguing acceessibility (+ basic findability and reusability)
parse mediaType+ interoperabilityFramework+ metadataScheme yes structural documentation interoperability
mapping interoperabilityFramework+ mappingType yes semantic documentation (semantic) interoperability
generate mediaType+ interoperabilityFramework+ metadataScheme yes structural documentation interoperability
deposit uploadUrl or accessUrl?+ accessService? no cataloguing accessibility
resource (downloadUrl or accessUrl+ accessService) and mediaType+ interoperabilityFramework+ metadataScheme yes cataloguing + structural documentation acceessibility + interoperability (+ basic findability and reusability)
filter filterType no - -
function interoperabilityFramework+ functionType yes - -
transformation transformationType no? - -

*Whether the filter type has knowledge of/is dependent of the underlying interoperability framework.

The OTEAPI filter configurations cover three of the four levels of data documentation (cataloguing, structural documentation, contextual documentation and semantic documentation). The contextual data documentation is assumed to already exists in the associated knowledge base.

Backward compatibility

This issue suggests a few changes in OTEAPI core. These should be handled by avoiding breaking existing code, but by adding deprecation warnings such that we can remove the deprecated features in a years time.

Pipeline Ontology (or Data Documentation Ontology?)

Common keywords that are shared between the configurations for the different filter types should be defined in the Pipeline Ontology. Should we rename it to Data Documentation Ontology (DDO)? It should use and build on DCAT as much as possible and have a clear connection to EMMO.

Examples of additional concepts and data properties that should be defined in this ontology are:

  • InteroperabilityFramework (class)
  • interoperabilityFramework (data property)
  • MetadataScheme (class)
  • metadataScheme (data property referring to the IRI identifying the metadata scheme within a given interoperability framework. For DLite this would be a data model URI)
  • mappingType
  • uploadUrl (could be a sub-property of dcat:downloadUrl)
  • depositUrl (really needed, or should we just use dcat:accessUrl?)
  • depositService (really needed, or should we just use dcat:accessService?)
  • filterType
  • functionType
  • transformationType
@jesper-friis jesper-friis changed the title Introduce Serialise and Upload strategies Revise data models Apr 29, 2023
@jesper-friis jesper-friis changed the title Revise data models Revise data models and filter types Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant