-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the full ISO 639-3 list of languages #10578
Conversation
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some high-level, doc, and test questions.
src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java
Outdated
Show resolved
Hide resolved
0d8e44d
to
e3bfe6c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -0,0 +1,12 @@ | |||
The Controlled Vocabulary Values list for the metadata field Language in the Citation block has been extended. | |||
Roughly 300 ISO 639 languages added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 300? Where did that subset come from? Are these 639-2 codes (guessing from zho which is in 639-2 and as far as I can see not in 639-3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list came from the Library of Congress. Yes, they are 639-2 codes. Once I merged all the codes and removed the duplicates the list increased by ~300.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR says that it "closes #8578". That issues is specifically for figuring out how to support ISO 639-3 list (thousands of languages). All the recent discussion in that issue was centered around that - whether it was feasible to just add that many to the CVV, whether the pulldown menu was still going to be usable in the UI and/or whether we wanted to consider using an external CV instead, etc.
I can understand it if we decide to handle this via incremental improvements, and first extend the list to the full 639-2, before proceeding with the 639-3. But I'm not seeing any discussion of this approach neither in #8578 nor in this PR. And, at the very least, I don't think this PR should be closing that issue.
Please note that there is real interest in the full ISO 639-3 support, see, for example, #10481, where a user is specifically requesting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pull this feature back and load the full 639-3 languages. The file seems like a lot but if I remember correctly it's one line per language and we combine the language codes so the final list would be shorter. Let me take another look at this.
Moving back to in progress
This comment has been minimized.
This comment has been minimized.
|
||
To be added to the 6.4 release instructions: | ||
|
||
Update the Citation block, to incorporate the additional controlled vocabulary for languages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the guides somewhere as well, not just in a release note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more
This comment has been minimized.
This comment has been minimized.
3 similar comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@stevenwinship There are 3 extra .tab files in |
They were part of the origin zip file. I don't think we need them so I'll delete them |
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear that the extended list of languages does not affect the performance of the UI.
I am ok with/like the idea of not adding the entire CV to the citation block .tsv as distributed, and instead treating it as an "extension pack" that can be added optionally.
As for the implementation, specifically the fact that extending this CV is handled in a very unique and customized way (using the Lib. of Congress tabular file format instead of our normal .tsv format; and adding a dedicated API for extending just this specific CV), I am ok with it, I think. (I can actually see the advantage in being able to use the LoC-distributed list directly whenever they update it going forward). However, I do have a reputation within this team for being fairly cavalier in my willingness to adopt non-standard solutions and hacks. So I would like to ask for a second opinion. @qqmyers What do you think?
Honestly, I'm not a fan of having a new mechanism and an API call dedicated to parsing this particular language-specific file format, particularly since it relies on one citation.properties file that has to cover both (plus the language-specific properties file variants.) I also think the current code has problems with the implementation of the merge of existing and new entries (see comments). Some of those are probably fixable in code (not sure about all - see the Pular example), but wouldn't exist if we didn't try an auto-merge. It may also be a pain when the citation block is updated. If that happens and you reload the block (say due to some non-language change to some unrelated metadata field), then reloading the block will restore just the alternates from the block and drop any from the ISO file. So unless we add the step of using the new API to update from the ISO file as well to the release notes, we'd have unintended changes. I do like the idea of this being optional/not a change to the citation block that everyone has to adopt, but I'm not sure doing it with a separate API and merging is worth it (versus, for example, a Javascript that might allow filtering to common/complete lists for selection). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments.
cvv = datasetFieldService.findControlledVocabularyValueByDatasetFieldTypeAndStrValue(dsv, codesIterator.next(), true); | ||
} | ||
|
||
// if it is found we need to merge the alternate codes since the next step will delete all existing alternate codes before adding the new ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With entries in the citation block, the code automatically deletes old alternate values and just uses the new list. Is there a reason that this code should try to preserve old alternates? Is this to keep the additional non-code alternates we have in some cases (like Haitian and Haitian Creole that we have for hat because we have the name "Hatian, Haitian Creole"?) Looks like this could be a problem for cases like ful where we have Pular as one of the alternates whereas Pular is a separate entry as fuf in the ISO file.
I'm also confused - it looks like this code would look for existing cvv entries that are alternates of an entry in the ISO file and then add it's alternate's as more alternates of the new ISO file, w/o deleting the original cvv entry? Are there any such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge does not remove the original codes since they may not be in the new file. This is why the original alts get saved and merged to the new list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - so this is a problem for Pular for example.
String line = null; | ||
String splitBy = "\t"; | ||
int lineNumber = 0; | ||
int offset = 200; // number of existing languages // TODO: get the number from db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there aren't 200 entries in the current develop branch citation.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are 187. This just gives wiggle room. also the display order is actually redundant if you always want to order the list alphabetically. A simple sql script can be written to re-order the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: Although I don't think it is necessary give the current code, we've so far tried to keep these in sequence w/o gaps. I still think the logic has a problem though - if the ISO file adds a line in the middle, you'll try to add the new entry at offset + line number and there will already be one with that displayOrder number and it won't change, so two will have the same displayOrder.
// Now call parseControlledVocabulary to create/update the Controlled Vocabulary Language | ||
// values: unused, type, displayName, identifier, display order, alt codes... | ||
String displayName = cvv != null ? cvv.getStrValue() : name; | ||
String displayOrder = String.valueOf(cvv != null ? cvv.getDisplayOrder(): lineNumber + offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to give a linear order from 0 to max. Suppose the first 200 match existing entries and get display order 0-199. The next entry gets a displayOrder = 200 (offset) + 200 (line number).
Also - how would this work with future changes. I.e. if the standard changes to split one language into two, the new entry will end up being last (because there are matching entries for everything else in the db) - is that desirable? (Not so sure displayorder is so important with 8K entries but it seems like the approach here will result in out of alphabetical order entries over time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to set the display order to remain alphabetical. Especially if the Dataverse installation isn't using English. There are 180 ish languages in the original tsv file. A sql script could be created to re-order them forcing the "Not applicable" last. The ISO639-3 file actually has a "zxx No linguistic content" entry at the end. It really isn't that noticable in the UI since it's filtered on the partial string you enter. I did reorder the tab file from the LoC since that too was ordered by the code and not the display string. I thought it was best to not mess with the display order since that could be set by the admins of the instalation. Maybe it would be better to bump up the offset to 1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ISO file itself seemed to be alphabetical. After one run of the code here, it looks like all of the current citation.tsv entries come first and then everything from the ISO file is in the order from that file. If there are changes in the ISO file though, the current code won't maintain it's internal order (either a duplicate displayOrder number as is, or the new entry will come after all existing entries if you update the code to query the db for number of entries (or max displayOrder) as in the todo.) Having the display order depend on when you run the api/if you ran it for intermediate changes in the ISO file, doesn't seem optimal. (Versus, for example, having a separate script to generate an always in sequence version of what would go in the citation.tsv file, or even just using only the line number from the ISO file as the displayOrder (or do we have citation.tsv entries that are not in the ISO file?)).
In the light of having confirmed that having the full list in the CV does not make the UI unusable, I would at least consider adding it in full to citation.tsv (and just giving up on the optional aspect of the expansion). My $0.02? |
This is being replaced by #10762, so this should close. |
Support the full ISO 639-3 list of languages #10578
What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).
Which issue(s) this PR closes: 8578
Closes #8578
Special notes for your reviewer:
Suggestions on how to test this: See setup instructions in ISO639IT.java test
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No. Only new languages in the list of languages
Is there a release notes update needed for this change?: Yes. to be included in this PR
Additional documentation: