Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conceptual model as single source of truth (chapters 2.1 and 7.1) #73

Open
RiittaA opened this issue Apr 28, 2023 · 6 comments
Open

Conceptual model as single source of truth (chapters 2.1 and 7.1) #73

RiittaA opened this issue Apr 28, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@RiittaA
Copy link

RiittaA commented Apr 28, 2023

We think that UML is not solid enough basis for conceptual modelling. We understand that UML has been chosen e.g. because of the graphical presentation that may be easier to business users to comprehend. But this may result in problems in later phases of the process. Therefore, we suggest an approach where the concepts are expressed in OWL, and the tool visualises them automatically or semi-automatically. Visualisation can be presented in UML-like notation. Shortly, the single source of truth should be based on a formal model, from which different representations are derived from.

  • SKOS vocabularies are conceptual models, as are OWL ontologies. SKOS is not logically rigorous to the same extent as OWL (due to the nature of skos:closeMatch, skos:broadMatch etc. properties).
  • The UML specification is not meant to be a formal verified specification in the sense of being logically internally coherent. It contains conceptual gaps (open definitions) and a wide latitude for interpretation. It is also a very heavy specification and not conceptually "lightweight" at all. In addition, the majority of UML modelers are not well-versed in MOF and have a rather pragmatic OOP-like interpretation of the principal structures. Many modelers only model with a specific stereotype for e.g. producing XML Schemas and see a direct 1:1 mapping between schemas and the UML Class diagram. There are nevertheless major gaps and differences in how class diagrams (and their limits) are interpreted. For example, many assume that polyhierarchies are disallowed by base UML because in their modeling domain it is prohibited. All in all, UML is not a conceptually stable basis, except when interpreting it with a very narrow common denominator in the user-base.
  • The single source of truth should be based on a formal model, from which different representations are derived from. A single source of truth should be formal, because we do not want to rely on differing conceptual interpretations of the base model. Instead, we want axioms on top of which we can form models and consistently end up with the same interpretations. OWL is based on description logics which provides this kind of truth. Granted, there is always interpretation in how OWL constructs are applied to a domain, but the construct themselves are mathematical entities and thus their structural semantics are unambiguous.
  • There is a fundamental recurring misunderstanding between UML modelers and the RDF domain in what a "class" encompasses. If our formal vocabulary is RDFS/OWL (as proposed by SEMIC), then we are talking about sets, not templates for types. We cannot give the wrong impression to modelers that e.g. attributes do not have an identity, whereas in OWL each DatatypeProperty is always an atomic individual and not owned by any class.
  • We must not confuse an application paradigm (e.g. OOP programming or data quality/lifecycle rules) with what the data conceptually is (i.e. what kind of information it is). The conceptual OWL model level must not attempt to enforce instance-related constraints on conceptual definitions on the kinds of instances. Naturally in instance data (ABox) we should require that a room must always be composed in exactly 1 building and its life-cycle is dependent on that of the building. But, this is separate from the conceptual essentialist definition of a room (we can model an ontology where the necessary condition for something to be a room is that it is in exactly one building). But, here we must remember that due to OWA only inconsistencies can be used to "invalidate" instance data, a missing link between a specific room and a buliding is just that - missing data and not a cause of concern on the ontological side.
  • The whole purpose of a formal conceptual model is to enrich the data, and this is separate from the purpose of an application profile used to validate instance data. The first is primarily descriptive, latter prescriptive (as said also in the Style Guide). The point is that the latter also operates within the conceptual frame provided by the first one, so here as well the possibilities for rich inferencing provide us opportunities to validate data that has not been explicitly tagged as belonging to a certain class or having certain properties - those classes or properties might surface during inferencing, allowing us to make the application profiles simpler and more universal.
  • Using UML as a single source of truth is risky also in that sense that the explicit assertions in UML models are not always transmitted 1:1 between modeling applications, as the XMI spec doesn't encompass unambiguously all possible use-cases, and applications interpret it in varying ways. It usually works, but it is not a reliable transfer format.
  • UML - particularly a comprehensive reliable XMI support - relies on costly modeling applications, which raises an adoption barrier for users. On the RDF side, there is a multitude of FOSS and freeware versions of commercial tools available, most of which rely on well-established libraries (TQ SHACL API, RDFLib etc.). Additionally, Turtle is an easy lightweight format for interpreting the models in addition to graphical representations. With UML we would be constrained purely to the graphical model, as the XML representations are not user-friendly.
@jitsedc jitsedc added this to the future work milestone Apr 28, 2023
@jitsedc jitsedc added the enhancement New feature or request label Apr 28, 2023
@costezki costezki added the question Further information is requested label Apr 28, 2023
@albertoabellagarcia
Copy link

My experience is that json schema provides 90% of the needs (100% for most users) with a very simple syntax, lots of software libraries to integrate it and applications working on real scenarios in actual clients.

@ioggstream
Copy link

Fully agree with @RiittaA on UML.

UML is not solid enough basis for conceptual modelling

+1 Moreover UML was build with a specific focus: OOP. OOP classes are not RDF classes.

the concepts are expressed in OWL, and the tool visualises them automatically or semi-automatically.

+1 We cannot introduce OOP abstractions just to have diagrams.

Moreover, this is going to confuse people using UML for generating classes: they expect every single UML bit to be reflected in the actual running code!

@ioggstream
Copy link

@albertoabellagarcia

json schema provides 90% of the needs (100% for most users) with a very simple syntax, lots of software libraries to integrate it and applications working on real scenarios in actual clients

I think json schema is great for implementation, but not for conceptual models.
The idea that Italy is working on is using json-schema keywords to map properties to RDF subjects using this specification https://www.ietf.org/archive/id/draft-polli-restapi-ld-keywords-02.html

This allows for easily adapting a conceptual model written in RDF/turtle to real implementations based on OpenAPI and JSON Schema.

@bertvannuffelen
Copy link
Contributor

I think we should firstly agree on one thing in this topic. Namely the separation between a technical implementation format and representation for a semantical conceptual model.
Many of the comments mix this: JSON has everything a developer needs, UML is a programming abstraction, RDF is the format of the Linked Data engineer, etc.
For building systems, these technical discussions are important, but that is not the topic of the SEMIC style guide.

The goal is to create a data specification that is implementation agnostic, focuses only on the semantics and has the ability to connect with implementation choices.
Unless you as system data engineer are prepared to take a step back from your system context and want to discuss information structuring in the broad, system agnostic way, you will get a discussion over mismatch, or misusing a representation.

This holds for any implementation context: e.g.

  • a soap engineer should not expect that the data structure is a hierarchy
  • a json rest engineer should not expect that the attributes are max 10 characters long and in camelcase
  • a Java OOP engineer should not expect that the diagram is an encoding of the object diagram
  • a RDF linked data engineer should not expect that cherry picking is allowed and that URIs are the sole way to identify the nodes.
  • a Relational DB engineer should not expect that the model is normalizable in a way that future changes will have a limited impact on its structure.

Each and every implementation context MUST define its implementation mapping. Sometimes the effort can be limited, sometimes this is extremely complex.
This is a key premise when we are discussing a data specifications according to the SEMIC styleguide: every community should take a step back from the representations that are provided: do not interpret them as ready-to-cook implementation languages but as means to share a common semantical view.

Now coming back to UML.
As stated in the motivation of the SEMIC style guide, the goal is to use a graphical representation language for the conceptual model. And preferable one that is adopted by the business analysis community, so that it aids to bridge the inter-human communication.
Besides the occasional academic alternatives, I see UML being used everywhere. Boxes for classes, lines for associations (object properties) and attributes (data properties). Instead of inventing a new graphical notation the SEMIC style guide states: let's use UML class diagram notation so that data specifications build by distinct organisations have a similar graphical representation. That will increase the common understanding. If everyone introduces its own legenda for the graphic notation, then we miss that opportunity.

This has led us to provide a common guideline on how to exploit the class diagram UML notation to make a diagram that resonates with a semantical textual description. Because the latter is the final goal: to express a semantical data specification: it is not the objective to prepare a OOP system implementation.

Now you can argue what should be the editorial environment: the semantical RDF style represenatation and have the diagram derived from it, or start from the diagram and have the RDF derived from it.
This is an editorial choice, yet very important. Unfortunately diagrams are very powerful when they are condensed and not overloaded. Since a semantic data specification is document that expresses an agreement between humans, each part should be somehow human friendly. Diagrams with 100's of classes and 1000's of lines crossing eachother are not accepted by humans.
That is the motivation to choose for the diagram to RDF direction. In this direction automation is possible. In the other direction, the likelihood is high that one will create both representations independent and thus synchronisation issues appear.

So the arguments pro or contra UML class diagram notation should not be about its binding to OOP system implementations. Every used "formal-ish" syntax will suffer from that.
The argumentation pro and contra should be about what is the common graphical language we like to use as community to document our data specifications.

Note that this graphical notation discussion does not exclude the use of other visual representations.

@ioggstream
Copy link

ioggstream commented Jul 13, 2023

Thanks @bertvannuffelen for your reply. I understand the practical goal you expect from using UML.

IMHO "exploiting" a notation/specification does not scale

  • It can work inside an organization.
  • It may work for a closed ecosystem.
  • It will fail at scale.

For example, look at the (apparently trivial) work on interoperability between YAML, JSON, JSON-Schema and JSON-LD here https://github.com/ietf-wg-httpapi/mediatypes: weeks of analysis with various implementers to avoid conflicts on the fragment identifier, the standardization of JSON-Schema media type is on hold until YAML mediatype will be published, the YAML-LD work was spun off to the YAML-LD... Long story short, when you "exploit" specs there's always more than meets the eye.

The argumentation pro and contra should be about what is the common graphical language we like to use as community to document our data specifications.

Reading https://github.com/SEMICeu/style-guide/blame/c444c915841fff0befc8ccc335d0175aed9b1c12/docs/modules/ROOT/pages/arhitectural-clarifications.adoc#L80

UML conceptual models can be used as the single source of truth

I understood the problem was that the UML was the language used for defining the models, not for just the rendering. I think the problem is the above sentence. Instead, it is OK to define:

  • a constrained subset of RDF for single source of truth
  • a mapping to render the above RDF in UML, thus ensuring a consistent visualization

this graphical notation discussion does not exclude the use of other visual representations

I have no problem in using UML just for data visualization.

@bertvannuffelen
Copy link
Contributor

Thanks @bertvannuffelen for your reply. I understand the practical goal you expect from using UML.

IMHO "exploiting" a notation/specification does not scale

* It can work inside an organization.

* It may work for a closed ecosystem.

* It will fail at scale.

For example, look at the (apparently trivial) work on interoperability between YAML, JSON, JSON-Schema and JSON-LD here https://github.com/ietf-wg-httpapi/mediatypes: weeks of analysis with various implementers to avoid conflicts on the fragment identifier, the standardization of JSON-Schema media type is on hold until YAML mediatype will be published, the YAML-LD work was spun off to the YAML-LD... Long story short, when you "exploit" specs there's always more than meets the eye.

I am not sure what you want to argue here. But the complexity to align between technical representations is out-of-scope. What is within (future) scope is that a semantic data specification should provide the anchors to make a YAML implementation connectable with a JSON implementation. (i.e. the area of artefact generators).
For me the "implementation distance" from the semantical data specification to any implementation representation is roughly the same. (So whether it is implementing it is XML, JSON, edifact, RDF, JAVA, ...) Because always the same decisions have to be made.

Technical formats and decisions are by definition ecosystem and organisation limited. The goal with this style guide is not fix a single XSD schema that everyone has to use, but it is about describing the semantics in such a way that profiling is transparently documented.
So if I implement DCAT-AP in my country then I know the rules how to further profile it for my country: e.g. I can enforce the need for a contact point with an email even if DCAT-AP does not specify that. Preferably I will do that in such a way that another country can interpret that. For instance to use vcard as ontology. When I publish in my country my DCAT-AP profile, then it should contribute to the ecosystem of DCAT in a seamless way, without the need to push my rules in the general DCAT.
If my implementation is geonetwork XML based and another CKAN JSON based then although these formats/datastructures are technically incompatible, the knowledge that both adhere common semantics means that it is possible for both, my implementation and the other implementation, to produce the data in a commonly agreed technical format expressing the data in the same semantics.
This common system agnostic technical format is often an RDF serialization. But that is actually a side effect from the choice to denote our terms unambiguous with dereferenceable URIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants