Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize languages controlled vocabulary values #10197

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions scripts/api/data/metadatablocks/citation.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@
authorIdentifierScheme DAI 5
authorIdentifierScheme ResearcherID 6
authorIdentifierScheme ScopusID 7
language Abkhaz 0
language Abkhaz 0 abk ab
language Afar 1 aar aa
language Afrikaans 2 afr af
language Akan 3 aka ak
Expand Down Expand Up @@ -220,7 +220,7 @@
language Khmer 79 khm km
language Kikuyu, Gikuyu 80 kik ki
language Kinyarwanda 81 kin rw
language Kyrgyz 82
language Kirghiz, Kyrgyz 82 kir ky
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost certain that this is not what we want to do. If the goal is to have both the "Kirghiz" and "Kyrgyz" spellings to be accepted as valid. Because the above would mean that either is invalid by itself, and only the full literal string "Kirghiz, Kyrgyz" is acceptable. So we should make the other spelling an alternate here as well, as in:
language Kyrgyz 82 kir ky Kirghiz

Just as I typed this, I realized that we have quite a few of such comma-separated entries in the block already!! Navajo, Navaho etc.

We will need to fix them all. And there is no way to do that, other than via a database update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To correct myself, there is definitely a way to address this without a Flyway update - for example, we can keep Navajo, Navaho as the main name, but add each of the 2 formats as a separate alternate, as in:
language Navajo, Navaho 109 nav nv Navajo Navaho)

language Komi 83 kom kv
language Kongo 84 kon kg
language Korean 85 kor ko
Expand Down Expand Up @@ -249,7 +249,7 @@
language Nauru 108 nau na
language Navajo, Navaho 109 nav nv
language Northern Ndebele 110 nde nd
language Nepali 111 nep ne
language Nepali (macrolanguage) 111 nep ne
language Ndonga 112 ndo ng
language Norwegian Bokmål 113 nob nb
language Norwegian Nynorsk 114 nno nn
Expand Down Expand Up @@ -284,12 +284,12 @@
language Shona 143 sna sn
language Sinhala, Sinhalese 144 sin si
language Slovak 145 slk slo sk
language Slovene 146 slv sl
Copy link
Contributor

@landreev landreev Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for dropping "Slovene" altogether? 639-2 lists both; and it looks like "Slovene" may still be the preferred name scientifically; the Wikipedia article lists it first, for example - https://en.wikipedia.org/wiki/Slovene_language). I'll just keep "Slovene" as one of the alternate forms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We proposed replacing the main language but it is possible to add an alternative language, yes.
These documents also show the language "Slovene" as a secondary name:

https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could leave "Slovene" as the main name, and simply add "Slovenian" as an alternate? - i.e., have this in citation.tsv:
language Slovene 146 slv sl Slovenian

The end result will be the same, both names will be valid and acceptable. It's just that changing the main name makes the block update so much more complicated (the block update API gets confused, so a Flyway database update becomes necessary).
I have a couple of similar questions about the other fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, if there is an objective reason to want to change the main entry for a specific language, it's not that big of a problem to add a flyway script to the release.

language Slovenian 146 slv sl
language Somali 147 som so
language Southern Sotho 148 sot st
language Spanish, Castilian 149 spa es
language Sundanese 150 sun su
language Swahili 151 swa sw
language Swahili (macrolanguage) 151 swa sw
language Swati 152 ssw ss
language Swedish 153 swe sv
language Tamil 154 tam ta
Expand Down