Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New metadata element]: status #345

Open
dr-shorthair opened this issue Jan 17, 2024 · 15 comments
Open

[New metadata element]: status #345

dr-shorthair opened this issue Jan 17, 2024 · 15 comments
Assignees

Comments

@dr-shorthair
Copy link

dr-shorthair commented Jan 17, 2024

Element id (e.g. creator_id, mapping_tool_version):

status

Value data type (e.g. URI, URL, text, xsd:boolean):

URI

Description
Indicate the status of this individual mapping. This allows a submission/review/publication/retirement lifecycle to be tracked using a set of mappings with the same subject/predicate/object.

Standard status values should be established for SSSOM. I suggest

  • draft [1..*]
  • submitted
  • reviewed [1..*]
  • accepted
  • published
  • withdrawn

Example description.

Complete example to a SSSOM file with this element

creator_id: https://orcid.org/0000-0002-3884-3420
curie_map:
   owl: http://www.w3.org/2002/07/owl#
   rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
   rdfs: http://www.w3.org/2000/01/rdf-schema#
   semapv: https://w3id.org/semapv/vocab/
   skos: http://www.w3.org/2004/02/skos/core#
   sssom: https://w3id.org/sssom/
   orcid: https://orcid.org/
   status: https://w3id.org/sssom/status/
   get: https://global-ecosystems.org/explore/
   alum8: http://www.neii.gov.au/def/voc/ACLUMP/australian-land-use-and-management-classification/
license: https://creativecommons.org/publicdomain/zero/1.0/
mapping_set_id: http://w3id.org/env/crosswalk/alum2get/2024-01-13
subject_id predicate_id object_id mapping_justification author_id mapping_date status comment
alum8:Marsh-or-wetlandsaline skos:exactMatch get:groups/MFT1.3 semapv:ManualMappingCuration orcid:0009-0001-6090-9959 2023-12-01 status:submitted
alum8:Marsh-or-wetlandsaline skos:exactMatch get:groups/MFT1.3 semapv:ManualMappingCuration orcid:0000-0002-2934-5497 2024-01-03 status:reviewed Looks Ok but let's check with Piers
alum8:Marsh-or-wetlandsaline skos:exactMatch get:groups/MFT1.3 semapv:ManualMappingCuration orcid:0000-0002-2568-59457 2024-01-10 status:reviewed approved
alum8:Marsh-or-wetlandsaline skos:exactMatch get:groups/MFT1.3 semapv:ManualMappingCuration orcid:0000-0002-2934-5497 2024-01-13 status:published
@dr-shorthair
Copy link
Author

dr-shorthair commented Jan 17, 2024

triggered by #309

Note that I have used author_id and mapping_date for all stages in lieu of more generic terms.

@gouttegd
Copy link
Contributor

I like it.

I agree about the re-use of mapping_date instead of creating a generic date slot – it makes sense if you consider that the slot is about the record (the “row”) rather than the mapping itself.

Also agree that collapsing author_id, creator_id, and reviewer_id in a single agent_id (as proposed in the initial discussion in #309) would be too much of a breaking change given the already existing and already in-use schema.

@gouttegd
Copy link
Contributor

Of note, though, regarding your example: the mapping_justification should be repeated for every record. A record should always be usable on its own, without having to look at other records to figure out what the justification was.

@matentzn
Copy link
Collaborator

I like it as well. I am wondering if we should consider a more descriptive name like mapping_status or similar?

@dr-shorthair would you like to join the SSSOM developer team and make the PR towards this field yourself? Do you have any experience with LinkML (no problem if not, I can hold hands).

@gouttegd
Copy link
Contributor

The list of possible values for the status slot should probably be discussed a bit more with other people interested, though.

In particular, I’m not sure about the need for a published value or the exact meaning it should have. A mapping is “published” when the mapping set it belongs to is itself published.

What would it mean, then, to have a mapping with a status other than published in a published set? Does that mean that the mapping should not actually be considered as “published”? Does that in turn mean that I (as a consumer of the mapping set) should ignore that mapping when I am using the set to do whatever I need to do with it?

@dr-shorthair
Copy link
Author

dr-shorthair commented Jan 21, 2024

@gouttegd I propose using published to indicate that the Mapping has completed the submission and review process, and is accepted as 'correct' by the gatekeepers designated by the community of interest. The only possible lifecycle stage after this is 'withdrawn' or 'retired'.

When you are looking at a mapping_set, for each set of mappings with a specific (subject_id, predicate_id, object_id) the valid status is the one with the latest mapping_date.

@dr-shorthair
Copy link
Author

I like it as well. I am wondering if we should consider a more descriptive name like mapping_status or similar?

Fine by me.

@dr-shorthair would you like to join the SSSOM developer team and make the PR towards this field yourself? Do you have any experience with LinkML (no problem if not, I can hold hands).

I have no experience with LinkML.

@dr-shorthair
Copy link
Author

Of note, though, regarding your example: the mapping_justification should be repeated for every record. A record should always be usable on its own, without having to look at other records to figure out what the justification was.

OK - I can buy that. Corrected above.

@gouttegd
Copy link
Contributor

gouttegd commented Jan 21, 2024

I propose using published to indicate that the Mapping has completed the submission and review process, and is accepted as 'correct' by the gatekeepers designated by the community of interest. The only possible lifecycle stage after this is 'withdrawn' or 'retired'.

But why would you want to include all the different steps of the submission and review process in the published mapping set?

Transparency and traceability is nice, but I don’t think it should go that far. Recording the history of a new mapping is the role of a version control system, the entire history doesn’t need to appear in the published set.

When you are looking at a mapping_set, for each set of mappings with a specific (subject_id, predicate_id, object_id) the valid status is the one with the latest mapping_date.

This is quite a breaking change. Currently, it is considered perfectly normal for a mapping set to have more than one record (one row) for any given mapping, because any given mapping can have multiple justifications (e.g., one asserted by manual curation, one asserted by semantic similarity, another asserted by semantic similarity but with a different tool, etc.).

Now, only the last record should be considered valid (previous records are the history of the last record), so it’s no longer possible to have several justifications for one mapping.

@dr-shorthair
Copy link
Author

reviewer_id and publication_date in SSSOM establish the precedent that some provenance and lifecycle information is expected in the mapping_set, separately from any information in a vcs. The subject-matter experts doing the mapping and review, etc, can cope with a tabular representation but are typically not familiar with Git, etc.

However, the current SSSOM element usage is ambiguous and far from complete. If we were to try to keep such information in a single mapping record, then a much much wider record would be required.

The proposal is for a more scalable approach, using multiple records per mapping, as suggested by @gouttegd, where I'm adding a status element.

But there has to be some way to know which records are valid for use. This must involve some combination of status and date. I might not have it quite right, but that is my rationale for trying.

@matentzn
Copy link
Collaborator

But there has to be some way to know which records are valid for use. This must involve some combination of status and date. I might not have it quite right, but that is my rationale for trying.

Starting to work on your proposal now; my stance on this specific question is that it is out of scope for SSSOM - we provide the metadata for the client to make the decision about "which records are valid", and the mapping set provider can, simply, publish a mapping set which only contains the valid mappings.

Stay tuned on a basic PR to continue the discussion.

@matentzn
Copy link
Collaborator

Here is a first PR to address the issue: #347, lets be strict about the implementation and get this right.

@dr-shorthair
Copy link
Author

Thanks @matentzn .

Perhaps the use-case I sketched above is out of scope for SSSOM. I was led toward the multiple-rows pattern by this comment from @gouttegd - but maybe I extrapolated too far. While there may be local applications of SSSOM that head in this direction, they do not need to be part of the SSSOM canon until there is more testing.

Nevertheless, I think the narrow proposal, for a mapping_status slot (better label than status, I agree) still has merit.

I find the need for a status flag in almost every bit of design work that I do, sooner or often later, and I'm forever scratching around for a suitable slot and enumeration to fit. It feels like one of those basic elements that should always be added early, like license and creator and creation_date.

@gouttegd
Copy link
Contributor

gouttegd commented Feb 1, 2024

I was indeed initially favourable to the idea – until I started thinking about the practical implications.

Perhaps you can clarify one thing: is your use case about storing the history of a mapping record (a), or about storing its current status (b)?

In other words:

  • (a) Can a single mapping set contain at any given time several records for the same mapping but with a different status, representing how the record evolved over time (from initial submission to review(s), approval, and publication)?
  • (b) Or should a set contain only one record representing the current status of the record, whatever that status at the time when the mapping set is compiled?

Your initial example clearly suggests (a), but maybe this was just for the example?

I don’t think I have any concerns with (b). But I do have concerns if the use case is about (a). I believe managing history is out of scope for a Simple Standard for Sharing Ontological Mappings, and it opens several thorny questions that must be addressed if we are ever to do that.

@dr-shorthair
Copy link
Author

dr-shorthair commented Feb 2, 2024

I accept that external consumers only want to see the 'current' state of the mapping - i.e. one row per mapping, use-case (b). Having a mapping_status flag to indicate to the user what the status of the mapping is important.

(Behind the scenes, we might persist with some form of (a). This would be mostly so that we can leverage the rigour of SSSOM - i.e. predicate, justification, confidence etc - without asking non-technical subject-matter experts to step outside of the tabular/spreadsheet paradigm. Keeping it all within one artefact - the spreadsheet - is important for the user-community I am working with. We can then filter on the latest-date prior to making it visible externally. But we can write our own rules for all that. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants