Sanitize languages controlled vocabulary values #10197

stevenferey · 2023-12-20T16:33:00Z

What this PR does / why we need it:

This is a first proposal open to proposals in order to fix the desired modifications before working on the flyway script.

Which issue(s) this PR closes:

Closes #8243

Special notes for your reviewer:

Provide your suggestions for modifications directly in the PR review

Additional documentation:

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Languages/List_of_ISO_639-3_language_codes_(2019)

https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

DS-INRAE · 2023-12-20T16:45:16Z

This PR's content can be used as a support to discuss the following issue (that has been taken into account in the PR) :

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

…nto a local branch; #8243

landreev · 2024-02-12T15:54:12Z

scripts/api/data/metadatablocks/citation.tsv

@@ -284,12 +284,12 @@
 	language	Shona		143	sna	sn
 	language	Sinhala, Sinhalese		144	sin	si
 	language	Slovak		145	slk	slo	sk
-	language	Slovene		146	slv	sl


What's the rationale for dropping "Slovene" altogether? 639-2 lists both; and it looks like "Slovene" may still be the preferred name scientifically; the Wikipedia article lists it first, for example - https://en.wikipedia.org/wiki/Slovene_language). I'll just keep "Slovene" as one of the alternate forms.

We proposed replacing the main language but it is possible to add an alternative language, yes.
These documents also show the language "Slovene" as a secondary name:

https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Any chance we could leave "Slovene" as the main name, and simply add "Slovenian" as an alternate? - i.e., have this in citation.tsv:
language Slovene 146 slv sl Slovenian

The end result will be the same, both names will be valid and acceptable. It's just that changing the main name makes the block update so much more complicated (the block update API gets confused, so a Flyway database update becomes necessary).
I have a couple of similar questions about the other fields.

To be clear, if there is an objective reason to want to change the main entry for a specific language, it's not that big of a problem to add a flyway script to the release.

landreev · 2024-02-15T00:17:52Z

scripts/api/data/metadatablocks/citation.tsv

@@ -220,7 +220,7 @@
 	language	Khmer		79	khm	km
 	language	Kikuyu, Gikuyu		80	kik	ki
 	language	Kinyarwanda		81	kin	rw
-	language	Kyrgyz		82											
+	language	Kirghiz, Kyrgyz		82	kir	ky									


I'm almost certain that this is not what we want to do. If the goal is to have both the "Kirghiz" and "Kyrgyz" spellings to be accepted as valid. Because the above would mean that either is invalid by itself, and only the full literal string "Kirghiz, Kyrgyz" is acceptable. So we should make the other spelling an alternate here as well, as in:
language Kyrgyz 82 kir ky Kirghiz

Just as I typed this, I realized that we have quite a few of such comma-separated entries in the block already!! Navajo, Navaho etc.

We will need to fix them all. And there is no way to do that, other than via a database update.

(To correct myself, there is definitely a way to address this without a Flyway update - for example, we can keep Navajo, Navaho as the main name, but add each of the 2 formats as a separate alternate, as in:
language Navajo, Navaho 109 nav nv Navajo Navaho)

landreev · 2024-02-15T17:33:07Z

And to further state the obvious, I was focusing on how these changes may affect metadata imports. I'm assuming that the intent behind the proposed changes to the main language names ("Slovenian", "Swahili (macrolanguage)" etc.) was how they appear in the UI menus (?). Both are important concerns, and it should be possible to reconcile them.

landreev · 2024-03-29T21:18:35Z

@setevenferey I was waiting for some feedback, but then got distracted by working on other things, so I never finished looking into this (apologies). I still would like to know if it is really necessary to change the main controlled vocabulary value, such as changing
Swahili
to
Swahili (macrolanguage) and a couple of other similar proposed changes in this PR? As I was saying, this can be done if there is a real need, but since changes like this cannot be handled by our normal metadata block update procedure a direct database update via Flyway would be needed - and we generally try to avoid that.
Could you please tell me what is your primary use case and the main reason to want to make these changes (to the swahili, nepali and slovene languages) -
a) metadata exports
b) metadata imports
c) what is shown in the CVV menu on the edit metadata page?
Adding missing ISO codes and alternative spellings on the other hand is not controversial at all.

…ocabulary-values

stevenferey · 2024-04-10T14:05:17Z

Hello @landreev,

We have no real need for the modification of the Swahili, Nepali and Slovenian languages, the goal is to be in agreement with the ISO standard but the sources of information are sometimes different.
We can effectively keep the language names unchanged with changes to ISO codes and alternative spellings.

like the proposal for the Slovenian language:
language Slovene 146 slv sl Slovenian

Thanks a lot

landreev · 2024-04-11T18:25:26Z

As I mentioned earlier, in place of this pr, I created my own branch and made a new pr: #10481.

DS-INRAE · 2024-04-16T07:24:38Z

@landreev as the new PR has been reviewed, should we close this one already :) ?

landreev · 2024-04-16T21:52:39Z

@landreev as the new PR has been reviewed, should we close this one already :) ?

Yes, we can close it now, or we can wait until #10481 is merged - I don't have a strong preference.

stevenferey · 2024-04-17T09:33:10Z

Thank you for your feedback,

Reviews of this PR are reflected in PR #10481
I propose to close this PR.
Thanks.

Initial proposal without script

91a950a

DS-INRAE mentioned this pull request Dec 20, 2023

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Closed

landreev mentioned this pull request Feb 12, 2024

Feature Request/Idea: Sanitize languages controlled vocabulary values #8243

Closed

landreev added a commit that referenced this pull request Feb 12, 2024

pushing the language controlled vocab additions suggested in #10197 i…

8796a1d

…nto a local branch; #8243

landreev reviewed Feb 12, 2024

View reviewed changes

landreev reviewed Feb 15, 2024

View reviewed changes

pdurbin added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 28, 2024

Merge branch 'IQSS:develop' into 8243-sanitize-languages-controlled-v…

25cbe9f

…ocabulary-values

landreev mentioned this pull request Apr 11, 2024

8243 improve language controlled vocab #10481

Merged

stevenferey closed this Apr 17, 2024

cmbz added the GREI 2 Consistent Metadata label May 2, 2024

luddaniel deleted the 8243-sanitize-languages-controlled-vocabulary-values branch September 20, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize languages controlled vocabulary values #10197

Sanitize languages controlled vocabulary values #10197

stevenferey commented Dec 20, 2023

DS-INRAE commented Dec 20, 2023

landreev Feb 12, 2024 •

edited

Loading

stevenferey Feb 14, 2024

landreev Feb 14, 2024

landreev Feb 15, 2024

landreev Feb 15, 2024

landreev Feb 15, 2024

landreev commented Feb 15, 2024

landreev commented Mar 29, 2024

stevenferey commented Apr 10, 2024

landreev commented Apr 11, 2024

DS-INRAE commented Apr 16, 2024

landreev commented Apr 16, 2024

stevenferey commented Apr 17, 2024

Sanitize languages controlled vocabulary values #10197

Sanitize languages controlled vocabulary values #10197

Conversation

stevenferey commented Dec 20, 2023

DS-INRAE commented Dec 20, 2023

landreev Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

stevenferey Feb 14, 2024

Choose a reason for hiding this comment

landreev Feb 14, 2024

Choose a reason for hiding this comment

landreev Feb 15, 2024

Choose a reason for hiding this comment

landreev Feb 15, 2024

Choose a reason for hiding this comment

landreev Feb 15, 2024

Choose a reason for hiding this comment

landreev commented Feb 15, 2024

landreev commented Mar 29, 2024

stevenferey commented Apr 10, 2024

landreev commented Apr 11, 2024

DS-INRAE commented Apr 16, 2024

landreev commented Apr 16, 2024

stevenferey commented Apr 17, 2024

landreev Feb 12, 2024 •

edited

Loading