Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the range of dcat:keyword #1585

Open
kvistgaard opened this issue Dec 6, 2023 · 10 comments
Open

Drop the range of dcat:keyword #1585

kvistgaard opened this issue Dec 6, 2023 · 10 comments
Labels
dcat future-work issue deferred to the next standardization round

Comments

@kvistgaard
Copy link

Since the range of dcat:keyword is rdfs:Literal, this makes application profile designers use alternatives such as dcterms:subject which reduces interoperability with catalogues using dcat:keyword

A common SHACL shape in EU is:

:Dataset-subject
  a sh:PropertyShape ;
  sh:path dcterms:subject ;
  sh:description "The value of this property is a keyword or tag describing the Data asset. It only allows values from the EuroVoc vocabulary http://eurovoc.europa.eu/ "@en ;
  sh:name "subject"@en ;
  sh:node [
      a sh:NodeShape ;
      sh:property [
          sh:path skos:inScheme ;
          sh:hasValue <http://eurovoc.europa.eu/100141> ;
        ] ;
    ] ;
  sh:nodeKind sh:IRI ;

It would be nicer to use the dedicated dcat:keyword.

@jakubklimek
Copy link
Contributor

Do you suggest to have a mix of literals and resources using dcat:keyword like this?

<dataset> dcat:keyword "Keyword literal"@en , <http://eurovoc.europa.eu/100141> .

If so, I do not think this will improve interoperability.

  1. Every implementation would now have to change to expect both literals, and resources, for which names would be somewhere else
  2. For your use case, there is dcat:theme, which can be used with controlled vocabularies. The difference from dcat:keyword is exactly that - keywords for free text (no controlled vocabularies) and themes for controlled vocabularies.

I think the current state is fine and we should not change that.

@kvistgaard
Copy link
Author

kvistgaard commented Dec 6, 2023

No, I only suggest to drop the range (in fact I would suggest to drop almost all ranges and leave that to application profiles).
For dcat:theme, there is a dedicated NAL http://publications.europa.eu/resource/authority/data-theme, usually one value. For keywords, always multiple values from Eurovoc, and that's is what I apply and keep suggesting.

@jakubklimek
Copy link
Contributor

Well, dropping the range effectively means supporting the case above, which in my opinion lowers interoperability.
For dcat:theme, the NAL is dedicated in DCAT-AP, not in DCAT. And, there are ongoing discussions about profiling dcat:theme in DCAT-AP:
SEMICeu/DCAT-AP#316
SEMICeu/DCAT-AP#314

@dr-shorthair
Copy link
Contributor

The distinction between

  1. dcat:keyword - range rdfs:Literal (datatype property)
  2. dcat:theme - range skos:Concept (object property)

has been in place since DCAT v1.
If you need the value to be a term from a controlled vocabulary, denoted by a URI, use dcat:theme.
If you want a text term, use dcat:keyword.

Bad habits developed in projects can't be fixed by modifying DCAT for everyone.

@kvistgaard
Copy link
Author

@dr-shorthair I'm aware of the distinction being from v1. The intention of raising this issue was to improve DCAT, not to make it suitable for a particular case. And speaking of bad habits, over-axiomatazing ontologies is definitely a bad habit in RDFS and OWL modelling in general, and not reserved for DCAT. But there is hope. A handy recent example is the range of dcterms:type dropped after being like that for much longer time than dcat:keyword. So, if anything, I might be raising this issue too early, not too late.

@bertvannuffelen
Copy link

I support the reaction from @jakubklimek. In this case the usage situation is clear and clean, and not restrictive.

In short:

  • When there is need to associate a term, not controlled by any list and in some language, and it is not the intend to add additional metadata about that term to a Dataset, then I want to express it as a literal. Hence, I use dcat:keyword.
  • When there is a need for additional control on the term, e.g. legally certified translations or an agreement by a group, then I use a controlled vocabulary. Hence, I want skos:Concepts (or similar) and not a literal. Hence I use a subproperty of dct:subject.

In the last case, dcat:theme is a special subproperty: namely the theme to which the Dataset is associated in the Catalogue. In this special case there is hopefully also not the discussion whether that could be a Literal. And note that for one profile the theme of another profile can be considered another categorisation.

So instead calling this a bad practice, in this case the range Literal versus Concept is corresponding to a business need. Both nicely address two distinct levels of harmonisation in the area of associating term to datasets to make them easiers findable in a catalogue by freetext search or facetted browsing.

By mixing, as illustrated by Jakub, DCAT states that the implementations must accept and being able to process both at the same time. It will create more implementation friction than gain. Lifting the distinction between data property and object property must be done care. And in this case it will not create added value, but more confusion.

Maybe you stumble over that the subproperty of dct:subject is not named 'keyword' when you use it in an implementation just as a keyword: that is a different discussion.

@l00mi
Copy link

l00mi commented Feb 10, 2025

We intent to open the Range for dcat-ap-ch to also allow skos:Concept and schema:DefinedTerm.

While we understand, that there is dcat:theme for clearly closed vocabualries which allow skos:Concept, in the use case of Switzerland, with its multilinguality, we see the need to have translated keywords for search purposes, without having a fixed set of (CV) of Terms.

(By simply allowing language tags for keywords, we can't distinguish the matching once we have multiple keywords regarding the language.)

We see the added potential friction for implementations, but we value the proper description of the datasets in a multilingual context here stronger.

We would be pleased to get in the loop regarding this topic.

@bertvannuffelen
Copy link

@l00mi I think you are considering to overengineer the keywords.
If you are planning to have words in multiple languages and they should match then you are entering in a high editorial effort.

If this is a fixed list, well established it is better to stay in the category area (with a codelist in which you provide controlled translations). (subproperties of dct:subject).

If it are keywords (i.e. random words that publishers like to attach to their catalogued resources) then an autotranslation service for that purpose is very usefull.
Your objective: "trying to maintain random words with a structured mapping to another language" is very resource intensive: you ask a lot from either the publisher (to translate each term being used) either from another team to ensure to coherency.
I would take the route for autotranslation and allow users to enter any term/word in any language.
Your elastic search engine will then perform then in each language in a very similar way (yet possibly rank the answers differently).

@l00mi
Copy link

l00mi commented Feb 10, 2025

@bertvannuffelen Thank you for your valuable feedback.

I understand your concern, regarding the high editorial efforts, and I understand the connection to dct:theme for clear cut CVs with a closed set of categories.

The goal we like to achieve is to have an open set, of multilingual tags. In the swiss context, we foresee to use https://termdat.bk.admin.ch for this (e.g. https://www.termdat.bk.admin.ch/entry/64371), and in addition links to Wikidata where the entries are missing (far from a "fixed" list).

The back of the coin, in a multilingual context, is the editorial effort to properly translate Keywords in different languages (where we have a legal obligation to). This is especially in the context of a loose set of Keywords, not easy to translate properly with an auto translation. Therefore we much rather like to use the know-how of the publishers to help here.

For your reference, we have a first draft of a usage note for our dcat-ap-ch here: https://github.com/opendata-swiss/eCH-0285-Use-of-controlled-vocabularies/blob/main/document/05_keywords.md

I hope this helps to understand our goal, and to also distinguish from dcat:themes/dcat:subject.

@bertvannuffelen
Copy link

My personal opion is that mixing literals and structured values is generally a bad practice and should be avoided as much as possible.
Enabling this not only impacts your local data catalogue but the whole data network of data catalogues through harvesting.

Considerations to take into account:

  1. Any harvesting data portal will be faced with the situation that some values of dcat:keyword are URIs and thus should be resolved to have a string value. (mixture of values)
  2. You have to decide what is the string value attached to the URI. But that creates even a larger problem for the harvesting data portal as it does not know your decision. E.g. wikidata concepts are not all skos:Concept thus you cannot rely on the assumption that skos:prefLabel is the method to find the label.
  3. Any Swiss city data portal using current DCAT as reference for its data scheme must adapt the software to match your profile decision. Thus you impose costs. It has to support all cases now.
  4. You cannot avoid blank nodes.
  5. Every profile has to write extensive usage note to explain their choices. Such usage notes are sources for interprofile conflicts.

In the example below I put all possible values that could be technically provided when opening up the range. I have seen them all on the same property.

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://swisstopo/opendata/dataset/1234>
  a dcat:Dataset ;
  dcat:keyword 
           <https://register.ld.admin.ch/termdat/215878>,                            # resolve externally using curl
           <https://register.ld.admin.ch/termdat/215878-humpydumpy>,    # a non existing value to be resolved externally using curl
           <http://www.eionet.europa.eu/gemet/concept/100>,                   # resolve internal in provided data
           _:node123,                                                                                       # resolve internally with rdfs:label for string value
           _:node124,                                                                                       # resolve internally with schema.org:name for string value
           "hochwasser"@"de",                                                                       # language string
           "water",                                                                                           # plain string
           "https://register.ld.admin.ch/termdat/215878"^^xsd:anyURI.       # resolve externally encoded as string value.
             

<http://www.eionet.europa.eu/gemet/concept/100> 
    skos:prefLabel "administrative body"@"en".

_:node123
   rdfs:label "administrative body".     # Is the decision for rdfs:label made by W3C, a local profile or the maintainer of the codelist?

_:node124
   schema:name "administrative body@"en". # Is the decision for rdfs:label made by W3C, a local profile or the maintainer of the codelist?

In your guidelines you make a lot of assumptions but leave the door still open for any of the above cases. Which of the cases you do not want to support?
The problem for me is that your selection is arbitrary: your selection will fit your needs, but another data portal would like it differenty. It is already very challeging with only structured values. Adding string values into the game is making it even more complicated.

My point of concern is that dcat:keyword is about an "uncontrolled use of range of values". That is best captured by a string/langstring approach. This is the most simple and naive method to tag datasets to increase their findability. We should not give up simple methodes if there already valid approaches in the specification that could support a more structured approach. Observe that any (semi)controlled approach is mappable into this representation.

2 relative simple approaches to meet your requirements:
a) When the context is the following: the only way a data gets into your portal is using an editorial form controlled by the portal. You encode in the form the codelists in the selection box and turn it on storage into string values.
b) If harvesting is involved from other portals, then create subproperties of dct:subject with the appropriate codelist as range restriction and then turn it on request to DCAT into literal values.

Because you mention that the termdat system has a legal basis for use, I am even more in favor for the second case. In that you can straightforward make a distinction between the ID use https://register.ld.admin.ch/termdat/215878 and the equivalent literal use "hochwasser"@"de". (They are from a skos perspective highly equivalent as SKOS imposes that each prefLabel is uniquely identifying a concept).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat future-work issue deferred to the next standardization round
Projects
None yet
Development

No branches or pull requests

7 participants