-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using identifierSpace when the entity set uses a mixture of namespaces #139
Comments
That idea of using full URIs as IDs and an empty |
OpenRefine is not the main client tool being targeted. OpenRefine is just one client that follows the reconciliation spec and OpenRefine will continue to evolve as the reconciliation spec evolves as well through versioning. We hope that other client tools and services might adapt the spec and provide feedback which helps standardize concepts of record linkage, linked data, and data augmentation and knowledge merging. I hope the spec and any documentation through sites we control within W3C and GitHub are very clear about those facts where OpenRefine is not the target but merely a client using the spec. If you read otherwise, please let us know where so we can make that clear. Regarding your specific namespace questions, I'll let others chime in with their thoughts. |
Thanks for the clarification @thadguidry . Indeed I know that OR is nowadays just one client among others and my phrasing was not ideal. But I think it has a certain status above others (more equal than other clients?) because of the history of the API. Rephrasing what I was trying to say above: I don't quite understand the use case for identifierSpace in its current form. It didn't seem to matter to OpenRefine at least. Should it really be a MUST in the spec? |
@osma I personally think that namespaces should be optional. Reconciling as a process has only a few core concepts entity,property, etc. and a simple reconciliation service about dog breeds or house roof architecture styles might not need to say anything about its entities being classified in a formalized schema or namespace, I.e. it might simply provide an id, entity name and nothing further in a result, with the thought being that its id MIGHT be used or referenced in another service or throughout the linked data ecosystem, ... but we should not push a service to provide a namespace. It would be a recommendation for best practices if they want to be a good citizen in the RDF or lightly in the linked data world. So I think we can change our phrasing to mention some of that. Namespaces are optional, but encouraged if you intend to participate in the linked data ecosystem. Interested to hear how others think of my views here. |
@fsteeg wrote:
Thank you for the pointer! This aligns pretty well with my thinking. In my view, the problem is not the use of URIs as IDs as such, but the (apparent) requirement in the spec to put the base URI in identifierSpace and only the local part as the ID in reconciliation results. It's doable when the set of entities comes from a single known namespace, but that's not always the case in some real world settings such as a mixture of complementary vocabularies (see OP). @thadguidry Thanks for your thoughts. Regarding this:
To be clear, I didn't intend to propose scrapping linked data URIs entirely, just avoiding the need for an identifierSpace and instead putting the full URI in the ID field. In my understanding, one can be a good RDF / Linked Data citizen without having all entities in the same URI namespace! Should I propose (as a PR) an amendment to the spec that makes identifierSpace optional, or at least make it clear that an empty string is a valid value for it? |
OpenRefine has the largest user base and longest history of experience with reconciliation services, so naturally has a prominent role in future definitions.
Without a namespace to qualify it, how would one know where this unqualified ID could be used? External out-of-band communication / coordination? That seems HUGELY suboptimal to me. I think that the concept of a single namespace for all identifiers served by a given service is obsolete (and probably was never ideal). Identifiers SHOULD resolve to full qualified IRIs, whether that be through the declaration of a default namespace or perhaps, more flexibly, through declaring one or more namespace prefixes such as is done in XML & RDF and then using them in conjunction with the identifiers. |
I think we could remove the identifierSpace and schemaSpace altogether, and instead use the view templates. The existing view template already specifies how the entity ids are turned into URIs. We could add other view templates to turn the properties and types into URIs. If a service returns URIs from different providers, then they likely use the entire URIs as ids, which is likely reflected by their view template. So for me, the change would be:
|
Thanks @wetneb , that sounds like a nice plan. In my view that feels like two similar but independent sets of changes:
The first set deals with identifierSpace, the second one with schemaSpace. I'm mainly interested in the first one as it's the more urgent problem right now. Shall I open a PR implementing the first one only, or a PR that does both, or two separate PRs one for each? |
Great! Feel free to open a PR in any form, I'd say :) |
I started editing the spec to prepare for a PR, but soon stumbled on this under 7.4 Data Extension Responses (emphasis mine):
So removing identifierSpace is not that simple as the spec for Data Extension Responses also relies on it. Here it seems to be used not just as a mechanism for expanding local entity IDs into global URIs/IRIs, which can indeed be replaced with view templates, but as a means of directing clients to another relevant reconciliation endpoint. I am not so familiar with the use cases for Data Extension so I'm not sure how important this mechanism is and whether it could be replaced with something else, if we decide to drop the notion of identifier spaces from the spec. |
It does not seem to be a big blocker to me, as the notion of identifier space is just used as a proxy to say "those identifiers are valid for this service". How about something like this? (rough formulation, the language can certainly be improved)
|
Newbie contributor question: I see that there are many example files under draft/examples/ which more or less correspond to the current spec text. If I make a PR that changes the spec, should I also adjust the examples accordingly? And in that case, all the examples, or is it enough to change just the ones that have been included in the rendered spec HTML using the ReSpec Asking because there are naturally quite a few examples that use |
Never mind, I already did that. I opened PR #140 which drops both identifierSpace and schemaSpace. Let's see what people think about this somewhat drastic change :) |
There has been some discussion on
identifierSpace
andschemaSpace
before, e.g. in issue #3 and PR #76. The definitions of these have shifted over time. The current definition, in both the latest draft spec and version 0.2, ofidentifierSpace
is:We are currently implementing reconciliation API support for Annif (see NatLibFi/Annif#734) and providing the identifierSpace information has caused some headache. Returning the service manifest is mandatory, and also the identifierSpace information is mandatory within the manifest: "A reconciliation service MUST define two URIs [...] identifierSpace ... schemaSpace"
Service manifest Example 1 given in the spec uses this identifierSpace:
(FWIW, I would like to point out that this doesn't seem to match well with the definition - IIRC this is not the URI namespace prefix for any Getty vocabulary, but a URI/URL of a web page explaining them. But that is a separate problem, maybe the example is just outdated.)
Annif uses SKOS vocabularies internally and often those vocabularies use a specific URI namespace; in my understanding, this would be the natural value for identifierSpace. But Annif is currently unaware of this namespace, and there is nothing in principle preventing a vocabulary from using a mixture of namespaces. For example, a vocabulary could consist of a mixture of Wikidata and GND entities. A perhaps more realistic example would be a mixture of YSO concepts and those of a domain-specific extension vocabulary such as KAUNO (fiction literature), JUHO (public administration) or TERO (health and welfare), all of which are extensions of YSO - you can think of them naively as additional concepts to add on top of YSO - that use their own URI namespace which is different from YSO.
So what should Annif return in the service manifest for a project that uses a vocabulary whose URI namespace it isn't aware of? Should it look at all the concept URIs and try to infer what is the longest common prefix? What if the URIs are a mixture of namespaces and there is nothing in common - say, a mixture of http and https URIs?
Or should the value be something more custom (somewhat like the Getty document in the example) that isn't really a URI namespace at all, but is unique to the vocabulary / entity set? For example, the reconciliation service at /rest/v1/projects/myproject/reconcile could return an identifierSpace of /rest/v1/vocabs/myvocab (i.e. the vocabulary used by myproject). That doesn't seem to match the current definition of identifierSpace, as it talks specifically about URI namespace prefixes, but would at least be a shared identifier that could also be referenced by other endpoints at the same Annif instance which use the same underlying vocabulary.
Or is it OK to return an identifierSpace of
""
(the current quick-and-dirty solution in the Annif draft PR) since it seems to work fine with OpenRefine - apparently this information is not used at all. Maybe providing identifierSpace shouldn't be a MUST in the spec, if it's actually not used by the main client tool that this API is targeting.The text was updated successfully, but these errors were encountered: