Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible improvements for template and converter #124

Open
2 of 20 tasks
dalito opened this issue Jul 5, 2023 · 16 comments
Open
2 of 20 tasks

Possible improvements for template and converter #124

dalito opened this issue Jul 5, 2023 · 16 comments
Labels
breaking Changes breaking backward compatibility discussion This needs more discussion
Milestone

Comments

@dalito
Copy link
Member

dalito commented Jul 5, 2023

Taken over from nfdi4cat/VocExcel#2

Template

This is for discussing possible future structural changes. This is not urgent but may serve as a checklist to review before the next big template-version step (in descending priority):

  • Get rid of "Additional Concept Features" sheet because it is hard to work with only numeric IDs without knowing the preferred labels. The columns could be moved to the Concepts sheet. Solved in another way in Idea: Show IRIs with optional label in xlsx IRI-columns #253 already released as part of v0.9.0.
  • The size of a collection is currently constrained by the maximum number of characters in xlsx cells (32767 chars). Depending on the IRI-length this may correspond to a few thousand concepts per collection (which is probably large enough for most applications). However, the UX how the membership in a collection is added or edited is rather poor. It would be much nicer, if membership could be marked in the concept sheet. This could be done by adding a memberOf column in sheets "Concept" with all collection-URIs the concept is member of. Then Excel filters could also be used efficiently to edit/review membership of Concepts in Collections. If we change the template like this, we will the limit on the number of collections goes away.
  • support for skos:altLabel in multiple languages. Currently it is assumed that all (comma-separated) altLabels are always given in the default language "en".
  • Support different languages for collection prefLabel.
  • Make multi-language data entry easier by specifying just one language per line in the concept sheet.
  • Put version of template to a better place and add information about min. version of voc4cat required.
  • Use a different separator than comma in xlsx cells. Users often use a comma as part of the text and/or use a semicolon as separator because they are used to semicolons from Excel formulas. It is suggested to separate urls from other urls or text by space (and/or) newline. For pure text fields (alternate label), we should consider to use a vertical bar | as separator.
  • Add a notes/feedback column (skos:editorialNote) for editorial purposes. It could be used by tools (or humans) to add notes which are relevant for editing the concept/collection. This column may also be used by checking tools.
  • Change provenance column to skos:changeNote column to store change notes including date and author. Allow multiple line each with <date> <gh-name> <change-note-text>. This structure will be validated so that correct DC:provenance data can be created for each concept & collection. (related: dcterms:provenance - Correctly used? #122)
  • Support two ways of giving credit to used sources:
    • (i) vebratim copies; the source should be entered in columns "Source Vocab" (dct:source), "Source Vocab license", ""Source Vocab Rights holder"
    • (ii) definitions influenced by other sources; these should be entered in "Influenced by IRIs" column.
  • Add a status column with states proposed/accepted/obsolete Auto generate skos:historyNote with date & state upon change. Suggested states to track: created, obsoleted because ... (see next point). This information will not be present in Excel but only in turtle.
  • Add column for reason of obsoletion and provide pre-defined reasons for obsoletion (inspired by https://wiki.geneontology.org/index.php/Obsoleting_an_Existing_Ontology_Term)
    • The term is not clearly defined and usage has been inconsistent.
    • This term was added in error.
    • More specific terms were created.
    • This term was converted to a collection.
    • The meaning of the term is ambiguous.
    • There is no evidence that this function/process/component exists.
  • (maybe) Use tables instead of hard-coded cell-positions and sheet names. Tables can be found independently of their cell position and "home" sheet. This would give users more flexibility to adjust the layout. (previously suggested here)

Converter

  • All users/editors of a vocabulary should use the same prefixes. That each user can adjust prefixes individually via the Prefixes sheet does not make sense. Instead the prefix sheet should be read-only. The shared prefixes for a vocabulary can already be defined in idranges.toml. Prefix-sheet was made read-only in Fix and improve handling of prefixes from config #263, released in v0.9.0.
  • The user should never change anything in the concept scheme sheet of the template. So the sheet should just created as info-page but never read. To realize this we need to extend the vocabulary configuration file. Some additional fields should be added (e.g. homepage-URL or issue-tracker-URL).
  • (maybe) Output SKOS-XL. Then a unique ID for each translation allows to make statements on the translated concept, e.g. about provenance of the translation.
  • (maybe) Support skos:orderedCollection
  • (maybe) Support not-yet supported SKOS relations like skos:broaderTransitive, skos:narrowerTransitive

Profile

  • Allow prefLabel in multiple languages (see vocexcel#1). We probably need our own SHACL vocabulary profile.
  • Several changes suggested above e.g. the use of skos:notes require profile changes.
@dalito dalito added discussion This needs more discussion breaking Changes breaking backward compatibility labels Jul 5, 2023
@dalito dalito added this to the 1.0.0 milestone Jul 15, 2023
@dalito
Copy link
Member Author

dalito commented Aug 4, 2023

Here is a draft for a new xlsx template structure that includes all the changes proposed above and below.

Note that the help sheet was not yet updated. Sheets Help, Version, and About will be removed from the final template.


Previous versions:
Here is a draft for a new xlsx template structure that includes all the changes proposed above.

@dalito
Copy link
Member Author

dalito commented Aug 31, 2023

The proposed new template (2nd draft) cannot handle that skos:collections may have not only concepts but also other collections as skos:member.

Note, the 0.4.3 template could also not express collection_A memberOf collection_B.

@markdoerr
Copy link
Contributor

I like esp. the first three items on the list, @dalito :)

@markdoerr
Copy link
Contributor

Hi @dalito,

regarding the collection item in the checklist above:
I would suggest the following process:

  1. collections (IRI and name) are registered in the "collections" tab
  2. in the "concept" tab each collection gets its own column (column name must match the collection name)
  3. to add a concept to a collection, simply a cross ("X") needs to be set in the corresponding column

@dalito
Copy link
Member Author

dalito commented Dec 13, 2023

Like the green column Q of the example file (2nd draft)? There would be one column per collection in concept sheet. I only added a single column to show the idea.

@dalito
Copy link
Member Author

dalito commented Dec 13, 2023

Collection in collection would be modeled in collections sheet just like narrower is modeled or concepts in the concept sheet. This is not yet in the 2nd draft IIRC.

@markdoerr
Copy link
Contributor

Like the green column Q of the example file (2nd draft)? There would be one column per collection in concept sheet. I only added a single column to show the idea.

yes, @dalito, and then in the cells the user just adds an "X" (small or capital should be allowed) - boolean would be nicer, but most non-programmers are not so familiar with this concept of True and False ;)

@markdoerr
Copy link
Contributor

Hi @dalito,
here some further usability improvement suggestions:
Children IRIs should be referenced by preferred Label (this is more readable and less error prone).

If possible, I would omit the Concept IRI from the Concept tab, completely. The Concept IRIs with the right padding could be automatically generated by the CI-pipeline. As numbering for the IRIs one could then just use the line numbers of the excel sheet. That would simplify the sheet.

@dalito
Copy link
Member Author

dalito commented Dec 13, 2023

Children IRIs should be referenced by preferred Label"

This easily breaks if a label is changed at one place but another is forgotten. In the past there were many problems with misspellings, case, white space or separator use. IDs are the solution to this.

It is possible to use indentation for expressing broader/narrower hierarchy between concepts. This requires a local install of voc4cat-tool. I would suggest to install pipx and then use pipx to install voc4cat-tool with pipx install voc4cat. To get help on the transformation to/from indentation run voc4cat transform --help.

@dalito
Copy link
Member Author

dalito commented Feb 9, 2024

@markdoerr In the childrenIRI field we could perhaps append the preferred label after each IRI. The label would just be present for convenience but would be stripped off when reading.

https://example.org/0000105 (infrared)
https://example.org/0000106 (visible)
https://example.org/0000107 (ultraviolet)

@dalito
Copy link
Member Author

dalito commented Feb 10, 2024

I updated the first message and put a new (3rd) draft for the template "1.0" to the 2nd message which addresses all issue/ideas that came up until now.

@markdoerr
Copy link
Contributor

Thanks @dalito,
sounds good, I will have a look ...

@dalito
Copy link
Member Author

dalito commented Jan 7, 2025

Julia @schumannj proposed in nfdi4cat/voc4cat#113 to replace ChildrenIRI by ParentIRI. For vocabularies where most concepts have parents (like in voc4cat) this makes a lot of sense. But flat concept schemes, in which most concepts are top-level concepts and only few narrower concepts exists, are better expressed with ChildrenIRIs as it is now.

So should it be configurable at vocabulary level to use either or? Is the added complexity justified?

@markdoerr
Copy link
Contributor

I personally prefer the parent relation in building hierarchies, because each child can have at most one parent (which is simpler than parents having multiple children - like in real life ;). ) For flat hierarchies, the difference is of course not big as @dalito pointed out. But, will we stay "flat" in the future ? Since one never knows, I would opt for the simpler (=ParentIRI) solution, suggested by @schumannj, since in the worst case one only would need to add one parent to a child (poor child).

@dalito dalito moved this from New to Backlog (>2 weeks) in Voc4Cat cross-repo view Jan 23, 2025
@dalito
Copy link
Member Author

dalito commented Jan 26, 2025

Following the thinking that entering broader (parent) is easier than narrower (children), we should also add a column "member of" to the concept sheet in order to enter membership in collections directly at the concept instead of the current way of adding all collection members in the collection sheet.

With these two changes, I am no longer convinced that the original idea of merging the "Concepts", "Additional Concept Features", and "Collections" sheets into one sheet should be pursued. Instead I suggest to keep the 3-table-split, because:

  • It matches with how people contribute. Typically you don't add concepts, new mappings and new collections all at once. But if the contributions are split (and they indeed have been split in voc4cat), then having one huge sheet will be less convenient to work with than 3 small sheets.
  • Keeping 3 sheets would allow us to take a gentler approach to the goal of a significantly improved xlsx template. Both the code and the user experience could change incrementally avoiding problematic breaking changes.

Update: I made a new 5th draft for next xlsx template above that integrates these changes.

@nmoust
Copy link

nmoust commented Jan 27, 2025

In a slightly relevant topic: A single excel cell fits 32767 characters. A URI in voc4cat is 41 characters long. If we add a comma and a blank space in between the URIs, a Collection or a Children cell can fit up to 762 members (780 without the blank space).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Changes breaking backward compatibility discussion This needs more discussion
Projects
Status: Backlog (>2 weeks)
Development

No branches or pull requests

3 participants