Can't harvest when Dublin core field language is set #8139

tcoupin · 2021-10-12T15:24:29Z

I try to harvest a record on an oaipmh server. This record is format in oai_dc schema and has the field language set to fr value (oai_dc specifies that language must be an ISO 639-1 code, 2 letters).

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
	<responseDate>
		2021-10-12T15:14:19+00:00
	</responseDate>
	<request verb="GetRecord" identifier="https://doi.org/10.23708/herbier-guyane-ird" metadataPrefix="oai_dc">
		http://doi2pmh.ird.fr/oai/
	</request>
	<GetRecord>
		<record>
			<header>
				<identifier>
					https://doi.org/10.23708/herbier-guyane-ird
				</identifier>
				<datestamp>
					2021-10-12T20:21:00+00:00
				</datestamp>
				<setSpec>
					Doi2Pmh
				</setSpec>
				<setSpec>
					UMR-AMAP
				</setSpec>
			</header>
			<metadata>
				<dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
					<identifier>
						https://doi.org/10.23708/herbier-guyane-ird
					</identifier>
					<publisher>
						UMR AMAP. CIRAD, CNRS, INRAE, IRD, Univ. Montpellier (France)
					</publisher>
					<title>
						L'herbier IRD de Guyane
					</title>
					<creator>
						Gonzalez, Sophie
					</creator>
					<creator>
						Bilot-Guérin, Véronique
					</creator>
					<creator>
						Delprete, Piero
					</creator>
					<creator>
						Geniez, Chantal
					</creator>
					<creator>
						Molino, Jean-François
					</creator>
					<creator>
						Smock, Jean-Louis
					</creator>
					<creator>
						Théveny, Frédéric
					</creator>
					<creator>
						IRD
					</creator>
					<creator>
						CIRAD
					</creator>
					<creator>
						INRAE
					</creator>
					<creator>
						Université de Montpellier
					</creator>
					<creator>
						Herbier de Guyane, Cayenne, Guyane française
					</creator>
					<creator>
						CNRS
					</creator>
					<description>
						L’Herbier IRD de Guyane (CAY), joue un rôle central dans l’acquisition et la diffusion des connaissances sur la flore de la Guyane française, et plus largement du Bouclier Guyanais et de l'Amazonie. Il a été créé en 1965 par R.A.A. Oldeman, et abrite aujourd’hui près de 200 000 spécimens collectés pour la plupart en Guyane française, mais aussi au Surinam, au Guyana, au Brésil (notamment dans l’État de l’Amapá) et au Vénézuela (État d'Amazonas).
					</description>
					<subject>
						FOS: Biological sciences
					</subject>
					<language>
						fr
					</language>
					<type>
						article
					</type>
				</dc>
			</metadata>
		</record>
	</GetRecord>
</OAI-PMH>

But the harvest is failling with the following error:

Exception processing getRecord(), oaiUrl=https://doi2pmh.ird.fr/oai/, identifier=https://doi.org/10.23708/h
erbier-guyane-ird, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu
.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'fr' does not exist in type 'language')

Language is a controlled vocabulary field and values are human readable: see https://github.com/IQSS/dataverse/blob/develop/scripts/api/data/metadatablocks/citation.tsv#L186

I think that the controlled vocabulary must refer to ISO 639-1 codes and human readable display value must be set with translation files.

Removing language field from record fix the harvesting.

The text was updated successfully, but these errors were encountered:

doigl · 2022-03-09T14:05:19Z

Same problem here with oai_dc and language "en":
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'en' does not exist in type 'language')

pdurbin · 2022-03-09T14:33:31Z

In citation.tsv I see lines like this:

	language	English		40	eng
	language	French		47	fra

I assume that's what's needed is some way to map "en" to "eng" and "fr" to "fra".

It looks like the 3 letter ISO-639-3 codes were added in pull request #7690 because of an inability to harvest "eng" datasets from Zenodo in #7638.

This issue seems to be related:

Feature Request/Idea: Sanitize languages controlled vocabulary values Feature Request/Idea: Sanitize languages controlled vocabulary values #8243

doigl · 2022-03-09T16:05:54Z

@pdurbin : Yes, this seems to be the case. I tried to identify a place in the code, where such a mapping could take place, but wasn't successful (perhaps adding and handling a new further-processing column in the foreignmetadatafieldmapping table?). Thought about just removing the language entry language from the foreignmetadatafieldmapping table as an ugly hack (as language is not really an important field for the harvested datasets), but am also unsure about side effects of this.

qqmyers · 2022-03-09T17:58:43Z

#7638 (comment) indicates that we can have multiple alternates - could en be added into the tsv without removing eng, etc?

landreev · 2022-04-20T20:27:35Z

Yes, this is just a matter of adding more alternative variants to the list of controlled vocabulary values in citation.tsv.
So yes, to add "fr" as a legitimate value you can change the following line in the citation.tsv that we distribute:

	language	French		47	fra

to

	language	French		47	fra	fr

and update the block (curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv)

But yes, we should add all these standard 2-letter codes to the block in the next release.

Fix #8139 : add iso-639-1 code for language as oai_dc specification

mreekie mentioned this issue Mar 10, 2023

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

pdurbin added the Feature: Harvesting label Apr 12, 2022

pdurbin mentioned this issue Apr 21, 2022

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Closed

mreekie mentioned this issue Mar 10, 2023

Collection: Keep track of list of issues that we want to address as part of 1.4.1 IQSS/dataverse-pm#25

Closed

20 tasks

mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022

tcoupin added a commit to tcoupin/dataverse that referenced this issue May 11, 2022

Fix IQSS#8139 : add iso-639-1 code for language as oai_dc specification

8ec50c8

tcoupin mentioned this issue May 11, 2022

Fix #8139 : add iso-639-1 code for language as oai_dc specification #8689

Merged

kcondon closed this as completed in #8689 May 23, 2022

kcondon added a commit that referenced this issue May 23, 2022

Merge pull request #8689 from tcoupin/citation-adlanguage-2letters

ec83f9b

Fix #8139 : add iso-639-1 code for language as oai_dc specification

pdurbin added this to the 5.11 milestone Jun 2, 2022

mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 4, 2022

mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't harvest when Dublin core field language is set #8139

Can't harvest when Dublin core field language is set #8139

tcoupin commented Oct 12, 2021

doigl commented Mar 9, 2022

pdurbin commented Mar 9, 2022

doigl commented Mar 9, 2022

qqmyers commented Mar 9, 2022

landreev commented Apr 20, 2022

Can't harvest when Dublin core field language is set #8139

Can't harvest when Dublin core field language is set #8139

Comments

tcoupin commented Oct 12, 2021

doigl commented Mar 9, 2022

pdurbin commented Mar 9, 2022

doigl commented Mar 9, 2022

qqmyers commented Mar 9, 2022

landreev commented Apr 20, 2022