-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HD SO documents incorrectly rejected for no identifier
#47
Comments
Turns out this is an issue with mnlite. The code that should handle this is in |
The problem mnlite had with these datasets is that there are two {
"@context": "http://schema.org/",
"@graph": [
{
"id": "_:b7",
"type": "Dataset",
"creator": null,
"identifier": null,
"text": "Content made available under CC BY-NC-ND 4.0 license. "
},
{
"id": "https://doi.org/10.7910/DVN/ZTWAFQ",
"type": "Dataset",
...
"identifier": "https://doi.org/10.7910/DVN/ZTWAFQ",
...
}
]
} I added code in |
This code has been tested and is working for a sample sitemap file with one of the offending datasets in it. I have restarted the scrape and will monitor and close this issue if it seems to be working. |
I’m noticing an issue when parsing some Harvard Dataverse metadata. For some reason
sonormal
rejects this document as having noidentifier
when it clearly does. There are many others like it that are being rejected as well (about 12,000 out of a total of 27,070 scraped so far). Unfortunately I think this means that I will have to debug and restart the scrape.Even weirder is that I wrote a few lines to debug a similar issue a while back, and the code I wrote is detecting identifier. From adjacent lines in the log:
Copy of JSON-LD for the playground in case the link doesn't work:
Copy of context:
The text was updated successfully, but these errors were encountered: