-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New slot mapping_id #359
Comments
Looks like see_also can be used to hold the URI of the individual mapping, can't it? |
I renamed the issue now, and I keep stumbleing over the fact that this issue has been discussed a lot of times, in the context of other issues like #91. Adding I see three options:
I dont know the answer, but we need to make a decision, so thanks for bringing this up again. There are so many advantages of having mapping_ids including:
|
Let’s first clarify what needs to be identified here: mappings proper, or mapping records? For clarity, what I call a mapping proper (or core mapping) is purely a triple {subject, predicate, object} (and optionally a predicate modifier). A mapping record is a mapping proper plus additional metadata (justification, provenance, etc.) – basically, a mapping record is a row in a SSSOM/TSV file. SSSOM primarily deals with mapping records, not core mappings. So, again, what needs to have a unique identifier here: the core mapping, or the mapping record? This needs to be clarified before any further discussion because the answer will dictate what can be done. Notably, the ticket is about identifying individual rows, so they ask for an identifier for mapping record. But Nico’s second idea (constructing an ID out of the subject-predicate-object triple) is only suitable to identify a core mapping (where several records could have the same identifier because they are about the same core mapping), so it would not be applicable if the goal is to identify records. |
For what it’s worth, I think having unique identifiers for core mappings does not make much sense. I believe in most case it’s records that people will want to uniquely identify. That is, we don’t need an identifier for the statement “the sky is blue“ (core mapping). We might need an identifier for the fact that John Doe (author), on April 9th 2024 (date), on the basis of careful observation (justification), stated that the sky was blue (core mapping). |
Yes I should have clarified this.
Hmmm.. So far I had exactly the opposite sense, because I was thinking about:
On the other hand, I can see why you would want to have an identifier for the whole record:
Grrrrr not an easy one to solve. |
I don’t see how this requires an identifier for the core mapping. Given a mapping set, it’s very easy for an application (such as a mapping browser of some sort) to query the set to get all the mapping records that have the same subject, the same predicate, and the same object. For efficiency purpose the application could build its own pseudo-identifier for each triple (for example by concatenating and hashing the subject-predicate-object triple) and have a cache map associating records to such pseudo-identifiers (to avoid having to always query the set), but that would be an internal implementation detail that doesn’t need to (and, In my opinion, shouldn’t) appear in the model. One reason why such a core mapping identifier shouldn’t be part of the model: it opens the way for possible discrepancies. What’s to prevent someone from misusing the identifier and doing, for example, this:
where both records are about the same core mapping (EXA:123 skos:exactMatch EXB:456) but with a different justification. Yet whoever created the mapping mistakenly attributed two different identifiers, which makes it appear that those two records are about two different core mappings. What should an implementation (or another user, for that matter) do here? Believe the identifiers and treat the two records as being about different core mappings? Or believe the subject-predicate-object triples and treat them as being about the same core mapping? (You could imagine the exact same problem with two records mistakenly having the same “core mapping identifier” despite having, for example, a different predicate. Again, what to do in such a case?) And no, mandating in the spec that the core mapping identifier should be constructed in a predictable way from the SPO triple (for example by concatenating the components of the triple, as in your idea 2) would not solve the problem, because you could always get a set that were incorrectly generated by a bogus implementation. Not having the core mapping identifier in the set at all, but instead let the applications compute such an identifier internally if they need one, avoids all those problems. |
I also like an identifier for core mapping / mapping proper / triple + mod, encoded e.g. base64. This will give us stable retroactive ids But for full SSSOM mapping records, I worry about when the data model changes. Can stable, retroactive ids or otherwise, be guaranteed in that case? If not, then maybe prepending the SSSOM version in front of such encoded ids would do the trick. We can do ids for both.
I don't agree it doesn't solve the problem. We can have Users are always prone to make boo-boos. We can alleviate that by adding this to Finally, I advocate for an official algorithm for this because it makes the ids globally unique / interoperable. That's another reason to have an id on triples/quads too. Interoperability between mapping specs, though concatenation alone may also suffice. |
A core mapping is already stably and uniquely identified by its subject ID, its predicate ID, and its object ID. What would we gain by having yet another identifier that is merely a derivation of those three?
SSSOM-Py is not the only software in the world who has to deal with SSSOM mapping sets. That’s actually why we try to have a standard specification.
OK. Simple question then: do we concatenate the IDs in their CURIE form or in their expanded form? The former will make SSSOM users/developers from the RDF world unhappy, because in RDF there are only IRIs, and having to condense the identifiers to a CURIE form just for the sake of deriving a triple identifier is downright ridiculous. Besides, it means the derived identifier will actually depend on the curie map used to condense the full-length identifier – so much for the stability and the interoperability of the triple identifier. The latter would make more sense, but it will make the SSSOM users coming from the bioinformatics world unhappy, because they usually prefer to deal with CURIEs and the idea of having to expand CURIEs to full-length IRIs just for the sake of deriving a triple identifier will likely seem equally ridiculous to them. By letting the applications that need a triple identifier (e.g. for performance reasons) compute their own internally without ever exposing it to the outside world, we avoid those problems entirely.
Well-designed tools and formats try to minimize the potential for boos-boos. |
As for identifiers for mapping records (which is what the author of ticket asked for, remember):
What do we actually expect from record identifiers? Do we want/need identifiers that depend entirely on the content of the record (like some kind of hash value calculated on the record), or do we want arbitrary identifiers (like serially or randomly generated identifiers)? Let’s say I have this mapping record:
and let’s assume it has an identifier M1 (regardless of how that identifier has been obtained/generated/derived). I use that identifier to annotate a bridging axiom in an ontology as being justified by the existence of this mapping record. Now let’s say that upon refreshing the mapping, we find that the similarity score is no longer 0.87 but 0.91 (maybe because the algorithm of the matching software has changed a bit). Thus we update the mapping record accordingly:
This is no longer the same record, right (the core mapping is the same, but at least one metadata field is different)? Now, question: Should M1 point to that updated record? If yes, it means we want identifiers that are not (or at least, not completely) dependent on the contents of a record. If no, it means we can have identifiers that are entirely derived from the contents of the record, but it raises the question of the usefulness of such identifiers if a single change in a record can make them invalid. |
Good point. I agree it should be expanded form.
I was thinking of that too. The main reason I can think of is that it could be less messy / confusing. Consider:
Which fields to included in record-based
|
What is the use case of an IRI pointing to a core mapping? I can understand wanting an IRI pointing to a mapping record, but to a core mapping, I can’t.
No. For now I am actually of the opinion that for record identifiers (leaving aside the question of core mapping identifiers), the identifier (if we do decide to add an identifier field to the spec) should be an opaque string about which the spec makes no assumption at all. Up to the users and/or the implementations to decide how the identifiers should be generated (serially, randomly, or by deriving a hash value from all or some of the records’ fields). The spec should not say anything about that. If we were to specify a way to derive identifiers predictably from the contents of the records, I doubt we could all agree on what should be the expected behaviour, in particular because there may not be a behaviour that is more correct that the others. There will be cases where we will want a record ID to still be valid even if 5 metadata fields have been changed, and cases where we will want a new record ID because a single metadata field has been changed. The spec should not force its view here. |
Never mind, whether I can see a use for it or not does not matter. The point is, such an identifier can always be derived in a completely predictable manner, so there’s no reason to have a field for it in the data model and the serialization formats. There would be no sense in doing this:
where XXXCOREMAPIDXXX would be some kind of hash computed on MONDO:123456789, skos:exactMatch, and DOID:123456789. This would merely duplicate (in a less readable way!) an information that is already present on the rest of the row. Applications can compute the identifiers on the fly if they need it. (Aside: This is also why I dislike the I would have no (strong) objection if the spec defined a standard algorithm to derive such an identifier (instead of letting each application implement its own algorithm as I initially stated), so that all applications automatically derive the same identifiers for the same core mappings. But this has nothing to do in the data model and the data formats. |
Nice discussion! This will take a while to hash out, so I hope we are all ready to be patient. I cant see a single thing being said in this thread that is completely wrong so far, so we need to accept a bit that our judgements are all a bit from the perspective of our use cases. Two small questions before we move on (please add the appropriate emoticon reaction):
|
Only because you want to do something that the author actually didn’t request. What is requested is a new slot to uniquely identify mapping records (“slot to identify individual rows” – they’re clearly talking about records here, not core mappings). This doesn’t need much elaboration. We can add a new
Apart from enforcing those constraints, the spec would say nothing more about this slot. It will treated as any other identifier in a set, that is basically as an opaque string to which no particular meaning should be attached (apart from the fact that it’s a string that can exist in two forms, full-length or CURIEfied – again as any other identifier). It’s up to the users and the implementations to decide:
About the uniqueness constraint Mandatory or optional? |
My use case requires identification of mapping records. I want to convert mappings between JSKOS and SSSOM. Each mapping in JSKOS has an URI (aka |
@gouttegd some clarifications
My example "quad" is showing an example of what an identifier might look like when concatenating the composite key of the "core mapping", or "quad" of subject, predicate, object, modifier. I agree that the question of the value of an ID for core mapping in addition to or instead is valid; good discussion so far on the merits of various options. Readability: hashing vs concatenation
My example was about a core mapping but the same applies (much more so, even) to mapping records. I think that field concatenation, e.g. |
This issue was not meant to be about core mapping but you might be interested we create identifiers for core mappings as well in our mapping infrastructure:
|
Github is really bad for these discussions. The issue has become too long already for any other person to enter the discussion, and its my fault. Lets forget about identifiers for core mappings for the moment and return to the ask in the OC. Am I seeing it correctly that that primary use case for |
Exactely! |
I think you missed my point. My issue has never been about whether the identifier was made solely by concatenating or by concatenating-and-hashing. My issue is with the basic idea of deriving the identifier from other fields, regardless of how that derivation is made. Whether you merely concatenate the subject-predicate-object triple or you concatenate it and then hash it, it boils down to the same: you end up duplicating the information containing in the triple itself. I don’t mind defining a standard algorithm to derive such a content-dependent value, but I object to having it stored anywhere in the model. It’s useless since it can always be re-generated on-demand, and its presence would cause more harm than good (notably by creating the possibility of discrepancies, and making the a set harder to modify by hand since the identifier will need to be updated after any change).
I object even more strongly to using a content-dependent “identifier“ for mapping records. In fact I don’t understand where this idea, that the identifier should be derived from the contents of the record, comes from. Imagine if we were doing the same for OWL classes in an ontology, if the identifier of a class was derived from the definition of the class. You do any change to the class (adding a synonym, a new relationship, amending the textual definition, whatever), and boom, the identifier of the class changes! All references to the old identifier are now invalid! I hope anyone would find that idea ridiculous, so why are we entertaining it when it comes to mapping records?
Again, no objection to defining a similar algorithm in SSSOM, but strong objection to adding a field to store it.
For mapping records, I really don’t see the point of doing that. But similarly, no strong objection to defining an algorithm to do the same for SSSOM records, provided again that we don’t reserve a field to store it. The (Well, if some users want to store in that field an identifier that they generated based on the contents of the record, they would of course be free to do so. But the spec would not mandate that.)
Unpopular opinion: Good old mailing lists are much better, if only because they naturally allow threaded discussions (provided you use a decent email client). But cool kids nowadays don’t want to use email anymore. That being said, if you think this discussion is a long and protracted one, you’ve probably never seen what’s happening on IETF mailing lists! :D |
Hello everyone. I am following the discussion here. I prefer to make a quick comment before getting into more details as it might take more time. The semantic web tells us that as soon as you start to say something about a resource you identify this resource with a URI and it becomes the subject of RDF statements. SSSOM is all about saying something about one specific mapping. This is the role of all the slots/properties defined by SSSOM. |
It looks like you’re talking about a core mapping here: the SSSOM properties that make up a record are to say something about a (core) mapping, so that (core) mapping would need an identifier so that we can talk about it. Is that indeed what you mean? If so, I (again) disagree. If I annotate an axiom in an OWL ontology, I am making a statement about the axiom. But the axiom itself doesn’t have its own identifier. It doesn’t need one, since it is already uniquely identified by its triple {source, property, target} – something that the XML/RDF serialisation makes quite obvious: <owl:Axiom>
<owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/CL_0000014"/>
<owl:annotatedProperty rdf:resource="http://www.w3.org/2000/01/rdf-schema#subClassOf"/>
<owl:annotatedTarget rdf:resource="http://purl.obolibrary.org/obo/CL_0000034"/>
<oboInOwl:is_inferred>true</oboInOwl:is_inferred>
</owl:Axiom> So, when the thing you want to make a statement about is nothing more than a combination of identifiers (such as a triple), it’s not obvious that you need yet another identifier to identify that particular combination. So I still don’t see the case for core mapping identifiers. As for mapping record identifiers, I do agree that they are probably a good idea, they have been explicitly requested, and I’ve made a proposal to add a field for it. Can we discuss about that? What we would need to agree on here is:
|
After reflexion, I think that if we do end up creating a slot to hold a mapping record identifier, it should probably not be called We already have a It is likely that many people would likewise interpret I think it would be best to minimise the potential for such needless confusion. I’d suggest something like |
Just want to highlight; good point. I think this is one of the stronger cons of content-dependent GUIDs. |
But this is not a problem at all if we agree that such GUIDs, if we do define them, must always be computed on the fly and never stored. That is, the spec defines the GUID generation algorithm, and implementations provide a helper method that implements said algorithm (a method that takes a mapping and returns the derived GUID). Applications that for some reasons need such a GUID can then easily obtain it, without the GUID having to ever appear in a SSSOM file. And defining such a GUID derivation algorithm does not preclude from also having a |
I think this is the best way forward. Note however that:
Is only true for the TSV serialisation - the whole point of this exercise is so that the RDF serialisation does have a resource identifier to satisfy the the FAIR principles. I also how to enforce the logical inverse: that the field is never manually said. This may be awkward without custom validation code, and maybe some people will ignore this. Maybe a slightly less brutal approach would be that the identififier must be valid in accordance to the GUID generation algorithm. This can be easily validated and we dont need to police as much if people for some reason which to share their mappings with that id in it. |
This would only be needed if the content-derived “identifier“ was the only identifier – something that I strongly oppose. If we do create a field to store a record identifier, it should be for an arbitrary identifier that is not automatically generated from the contents – an opaque string, as for (as far as I know) any other identifier in the RDF world. An automatically derived content-dependent “identifier“ (what we have called in the previous messages a “GUID”, though this is misleading as there are others way to get “globally unique identifiers“ without deriving them from the contents of what they identify) would be an additional feature of the spec, not the only identifier. And we should only define such a GUID if we have a good reason to do so, which for now we don’t – what is the use case for an “identifier“ that changes every time the contents of the record changes? |
I bet this has been mentioned before but I did not find the corresponding issue: SSSOM lacks a slot to identify individual rows. This is required to fully map from/to JSKOS Mapping format (#249) but I'm sure there are other use cases as well.
The text was updated successfully, but these errors were encountered: