-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special characters in dataset metadata prevent Dataverse repositories from harvesting from Harvard Dataverse #72
Comments
This might be related to invisible characters in the metadata of datasets. I was asked to try to find special characters that are breaking harvesting, and I've found that datasets with the hexadecimal character 0x0C in their description fields are breaking metadata export in general (in the UI, over API and over OAI-PMH) and making it impossible to edit the dataset metadata, at least in the UI - nothing happens when I click the dataset's Edit Metadata button. You can find 35 datasets with this character in their description fields by querying the database, copying and pasting:
The where clause with These three datasets, published earlier this year, are among the 35 :
I can reproduce this problem in Demo Dataverse, so this issue should probably be moved to the general Dataverse GitHub repo, or maybe information in this issue should be moved into an existing issue in the general Dataverse GitHub repo. This doesn't seem to affect dataset publishing, so doesn't seem to fit the scope of IQSS/dataverse#3328. |
@jggautier I harvested the IQSS set from the Harvard repo too using dataverse_json and I got a success using version 4.20. I'm investigating the 2668 failures, but I'm wondering what's different about demo dataverse compared to my local version. Do you know this? FAILED INPROGRESS |
Thanks for investigating! =) I don't know what's different about demo dataverse compared to your local Dataverse installation, or at least I can't imagine what difference might help explain why your local installation is able to harvest more datasets than demo dataverse is. Earlier today I told demo dataverse to harvest Harvard Dataverse's IQSS set again. It's at about 12,000 so far and chugging along. I'll check tomorrow to see how much it gets. I don't know how to get a log message from demo dataverse. @djbrooke, would a developer be available to get @JingMa87 this info? I wrote in an earlier comment that I think special characters in the metadata of some datasets are causing at least some of these errors. I think that because when you view one of Harvard Dataverse's smaller sets, https://dataverse.harvard.edu/oai?verb=ListRecords&set=Princeton_Authored_Datasets&metadataPrefix=oai_datacite (or any metadataPrefix, like oai_dc), my Firefox browser reports the first error, which I think involves a dataset with metadata that contains the hexadecimal character I mentioned. I'd be interested to know if the 37 datasets I found with that hexadecimal character are among the 2,668 datasets that your local installation couldn't harvest. If you think that'll be helpful, I can send the DOIs to you (or you could send the 2,668 DOIs to me). I realize that this issue is also larger than harvesting, since as I've mentioned the special character is preventing the metadata of these problem datasets from being exported in any way: through the UI, with Dataverse's APIs, or over OAI-PMH. Also, these datasets' metadata can't be edited through the UI - people aren't able to update their metadata. So I wonder if this GitHub issue should be reframed so that it gets prioritized a little higher. (More of a question for @djbrooke and @scolapasta). |
Special characters 2668 IQSS set failures Dublin Core |
Ah, I see. When testing harvesting from one Dataverse repository to another, I usually use the prefix dataverse_json since it has the least metadata loss. So I tried harvesting the Princeton_Authored_Datasets set again using dataverse_json, and Demo Dataverse failed to harvest all records. But we'd also like non-Dataverse based repositories to harvest this set, so I tested harvesting using oai_dc (harvested 81 records/186 failed) and oai_datacite (harvested 0 records/all failed). Since I can't tell what differences between Demo Dataverse and the local Dataverse installation you're using might be causing our different testing results, I'll ask @djbrooke to see if a developer who can get more info about Demo Dataverse can help. Maybe my using Demo Dataverse isn't the right approach anyway, since it can't really be used to predict how successfully Dataverse repositories can distribute metadata records to other systems, which we have no control over.
Trying to make sure I understand this :) Since the format is JSON, and OAI-PMH requires XML, the XML includes an API call, which the the harvesting Dataverse repositories then uses to try to get the JSON metadata, right? And the problem is that the harvesting Dataverse, e.g. your local testing Dataverse, doesn't like these custom keys in the JSON? "ARCS1" is a metadata field from a metadatablock that's only enabled in Harvard Dataverse, so I wouldn't expect other Dataverse installations to know what to do with it. But instead of ignoring it, the harvest fails. Is that right? I can try to test this, too, though I'm becoming less confident in how much help I can provide. Since we promote using dataverse_json when Dataverse repositories harvest from each other (to reduce metadata loss), this sounds like a big problem for harvesting between Dataverse repositories, but not immediately more urgent than getting harvesting to work when using the Dublin Core and DataCite standards (my immediate need being to help the library from Princeton harvest the Princeton_Authored_Datasets set into their non-Dataverse system). Thanks again for helping troubleshoot. I know you're working on other Dataverse issues and I hope this has been somewhat helpful to you :) |
Since I'm running the newest version of Dataverse, I think that other installations will harvest correctly when they update to this version.
This is 100% correct and there's a good chance I can fix this one so I'll look into it. Is there a GitHub issue for this on the main dataverse project? Otherwise I can make one. |
@jggautier I made an issue and fix for the unknown types in the dataverse_json: IQSS/dataverse#7056 |
Thanks so much @JingMa87! |
@jggautier
Something like a simple replace function might be enough: https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java |
Thanks for following up on this, and for discovering and working on IQSS/dataverse#7056, which seems like part of the cause of Dataverse repos not being able to harvest from Harvard Dataverse (this broad issue). I think it'll be helpful to rename this GitHub issue to be specifically about failures caused by special characters in the metadata, so I'll do that. |
Related to #39 |
Specifically, UNC Dataverse and Demo Dataverse are unable to harvest all records in Harvard Dataverse's "IQSS" set, and Demo Dataverse is unable to harvest all records in Harvard Dataverse's default set. Both the IQSS set and the default set contain metadata records of all datasets deposited in Harvard Dataverse.
UNC's repository (Dataverse version 4.16) has harvested about 20,000 of Harvard Dataverse's 32,000+ datasets from the IQSS set using the oai_ddi metadata format. Don Sizemore let me know that the repository's superuser dashboard reports that the last harvesting attempt was Apr 29, 2018 and that it's been "INPROGRESS" since.
Demo Dataverse (version 4.20) harvested fewer than 14,000 of Harvard Dataverse's datasets from the IQSS set using the dataverse_json format. At the time, May 11, 2020, its superuser dashboard reported that the attempt on May 11, 2020 FAILED. Then I set Demo Dataverse to harvest from the default set, using dataverse_json format, and it also failed, again harvesting fewer than 14,000.
The text was updated successfully, but these errors were encountered: