Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HD SO documents incorrectly rejected for no identifier #47

Closed
iannesbitt opened this issue Nov 9, 2023 · 3 comments
Closed

HD SO documents incorrectly rejected for no identifier #47

iannesbitt opened this issue Nov 9, 2023 · 3 comments
Assignees
Labels
bug Something isn't working v0.1.1 Version 0.1.1 item
Milestone

Comments

@iannesbitt
Copy link
Contributor

iannesbitt commented Nov 9, 2023

I’m noticing an issue when parsing some Harvard Dataverse metadata. For some reason sonormal rejects this document as having no identifier when it clearly does. There are many others like it that are being rejected as well (about 12,000 out of a total of 27,070 scraped so far). Unfortunately I think this means that I will have to debug and restart the scrape.

Even weirder is that I wrote a few lines to debug a similar issue a while back, and the code I wrote is detecting identifier. From adjacent lines in the log:

2023-11-09 22:14:31 [sonormal] DEBUG: Found entry under http://schema.org/identifier:
{
  "@list": [
    {
      "@value": "https://doi.org/10.7910/DVN/4NLGYN"
    }
  ]
}
2023-11-09 22:14:31 [scrapy.core.scraper] WARNING: Dropped: JSON-LD no identifier: https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/4NLGYN/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20231109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231109T221428Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=b9d1e1693a878992bd190cab58b625f95bb2b2fe00f6f5923245135939b6bd6c

Copy of JSON-LD for the playground in case the link doesn't work:

{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/10.7910/DVN/ZTWAFQ","identifier":"https://doi.org/10.7910/DVN/ZTWAFQ","name":"A11_19237.JPG","creator":[{"name":"Master, Daniel M.","affiliation":"(Wheaton College)"},{"name":"Stager, Lawrence E.","affiliation":"(Harvard University)"}],"author":[{"name":"Master, Daniel M.","affiliation":"(Wheaton College)"},{"name":"Stager, Lawrence E.","affiliation":"(Harvard University)"}],"datePublished":"2021-10-19","dateModified":"2021-10-19","version":"1","description":["Link to OCHRE database: http://pi.lib.uchicago.edu/1001/org/ochre/b5450665-5d9a-4c91-933b-8d4b84982639"],"keywords":["Arts and Humanities","Archaeology"],"license":{"@type":"Dataset","text":"Content made available under CC BY-NC-ND 4.0 license. "},"includedInDataCatalog":{"@type":"DataCatalog","name":"Harvard Dataverse","url":"https://dataverse.harvard.edu"},"publisher":{"@type":"Organization","name":"Harvard Dataverse"},"provider":{"@type":"Organization","name":"Harvard Dataverse"},"funder":[{"@type":"Organization","name":"The Leon Levy Foundation"}],"spatialCoverage":["Ashkelon"],"distribution":[{"@type":"DataDownload","name":"A11_19237.JPG","fileFormat":"image/jpeg","contentSize":4492640,"description":"http://pi.lib.uchicago.edu/1001/org/ochre/b5450665-5d9a-4c91-933b-8d4b84982639"}]}

Copy of context:

{
    "@context": "http://schema.org/",
    "@type": "Dataset",
    "identifier": {},
    "creator": {}
}
@iannesbitt iannesbitt added the bug Something isn't working label Nov 9, 2023
@iannesbitt iannesbitt self-assigned this Nov 9, 2023
@iannesbitt
Copy link
Contributor Author

Turns out this is an issue with mnlite. The code that should handle this is in soscan.sonormalizepipeline.SoscanNormalizePipeline.process_item. Transferring this issue.

@iannesbitt iannesbitt transferred this issue from DataONEorg/sonormal Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
@iannesbitt
Copy link
Contributor Author

The problem mnlite had with these datasets is that there are two Dataset groupings in the framed JSON-LD because there was an incorrect tag of the license as a Dataset. So when framing, the first Dataset had identifier empty and the second one had the field filled, and we were looking only at the identifier in the first grouping:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "_:b7",
      "type": "Dataset",
      "creator": null,
      "identifier": null,
      "text": "Content made available under CC BY-NC-ND 4.0 license. "
    },
    {
      "id": "https://doi.org/10.7910/DVN/ZTWAFQ",
      "type": "Dataset",
      ...
      "identifier": "https://doi.org/10.7910/DVN/ZTWAFQ",
      ...
    }
  ]
}

I added code in process_item that handles multiple Datasets in the item['jsonld'], and added warnings to the logger when the first identifier is empty. Technically this is incorrect JSON-LD formatting, but we should be able to handle multiple Dataset items anyway. I will let the folks at Harvard Dataverse (DataONEorg/member-repos#52) know about this issue.

@iannesbitt
Copy link
Contributor Author

This code has been tested and is working for a sample sitemap file with one of the offending datasets in it. I have restarted the scrape and will monitor and close this issue if it seems to be working.

@iannesbitt iannesbitt added the v0.1.1 Version 0.1.1 item label Nov 13, 2023
@iannesbitt iannesbitt added this to the 0.1.1 milestone Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.1.1 Version 0.1.1 item
Projects
None yet
Development

No branches or pull requests

1 participant