HD SO documents incorrectly rejected for `no identifier` #47

iannesbitt · 2023-11-09T22:44:45Z

I’m noticing an issue when parsing some Harvard Dataverse metadata. For some reason sonormal rejects this document as having no identifier when it clearly does. There are many others like it that are being rejected as well (about 12,000 out of a total of 27,070 scraped so far). Unfortunately I think this means that I will have to debug and restart the scrape.

Even weirder is that I wrote a few lines to debug a similar issue a while back, and the code I wrote is detecting identifier. From adjacent lines in the log:

2023-11-09 22:14:31 [sonormal] DEBUG: Found entry under http://schema.org/identifier:
{
  "@list": [
    {
      "@value": "https://doi.org/10.7910/DVN/4NLGYN"
    }
  ]
}
2023-11-09 22:14:31 [scrapy.core.scraper] WARNING: Dropped: JSON-LD no identifier: https://dvn-cloud.s3.us-east-1.amazonaws.com/10.7910/DVN/4NLGYN/export_schema.org.cached?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20231109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231109T221428Z&X-Amz-Expires=7200&X-Amz-SignedHeaders=host&X-Amz-Signature=b9d1e1693a878992bd190cab58b625f95bb2b2fe00f6f5923245135939b6bd6c

Copy of JSON-LD for the playground in case the link doesn't work:

{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/10.7910/DVN/ZTWAFQ","identifier":"https://doi.org/10.7910/DVN/ZTWAFQ","name":"A11_19237.JPG","creator":[{"name":"Master, Daniel M.","affiliation":"(Wheaton College)"},{"name":"Stager, Lawrence E.","affiliation":"(Harvard University)"}],"author":[{"name":"Master, Daniel M.","affiliation":"(Wheaton College)"},{"name":"Stager, Lawrence E.","affiliation":"(Harvard University)"}],"datePublished":"2021-10-19","dateModified":"2021-10-19","version":"1","description":["Link to OCHRE database: http://pi.lib.uchicago.edu/1001/org/ochre/b5450665-5d9a-4c91-933b-8d4b84982639"],"keywords":["Arts and Humanities","Archaeology"],"license":{"@type":"Dataset","text":"Content made available under CC BY-NC-ND 4.0 license. "},"includedInDataCatalog":{"@type":"DataCatalog","name":"Harvard Dataverse","url":"https://dataverse.harvard.edu"},"publisher":{"@type":"Organization","name":"Harvard Dataverse"},"provider":{"@type":"Organization","name":"Harvard Dataverse"},"funder":[{"@type":"Organization","name":"The Leon Levy Foundation"}],"spatialCoverage":["Ashkelon"],"distribution":[{"@type":"DataDownload","name":"A11_19237.JPG","fileFormat":"image/jpeg","contentSize":4492640,"description":"http://pi.lib.uchicago.edu/1001/org/ochre/b5450665-5d9a-4c91-933b-8d4b84982639"}]}

Copy of context:

{
    "@context": "http://schema.org/",
    "@type": "Dataset",
    "identifier": {},
    "creator": {}
}

The text was updated successfully, but these errors were encountered:

iannesbitt · 2023-11-10T22:08:46Z

Turns out this is an issue with mnlite. The code that should handle this is in soscan.sonormalizepipeline.SoscanNormalizePipeline.process_item. Transferring this issue.

changing logging levels for #47

iannesbitt · 2023-11-11T01:31:19Z

The problem mnlite had with these datasets is that there are two Dataset groupings in the framed JSON-LD because there was an incorrect tag of the license as a Dataset. So when framing, the first Dataset had identifier empty and the second one had the field filled, and we were looking only at the identifier in the first grouping:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "_:b7",
      "type": "Dataset",
      "creator": null,
      "identifier": null,
      "text": "Content made available under CC BY-NC-ND 4.0 license. "
    },
    {
      "id": "https://doi.org/10.7910/DVN/ZTWAFQ",
      "type": "Dataset",
      ...
      "identifier": "https://doi.org/10.7910/DVN/ZTWAFQ",
      ...
    }
  ]
}

I added code in process_item that handles multiple Datasets in the item['jsonld'], and added warnings to the logger when the first identifier is empty. Technically this is incorrect JSON-LD formatting, but we should be able to handle multiple Dataset items anyway. I will let the folks at Harvard Dataverse (DataONEorg/member-repos#52) know about this issue.

iannesbitt · 2023-11-11T01:33:29Z

This code has been tested and is working for a sample sitemap file with one of the offending datasets in it. I have restarted the scrape and will monitor and close this issue if it seems to be working.

iannesbitt added the bug Something isn't working label Nov 9, 2023

iannesbitt self-assigned this Nov 9, 2023

iannesbitt referenced this issue Nov 10, 2023

updating test sitemap (#45, DataONEorg/sonormal#4)

ed7b18b

iannesbitt transferred this issue from DataONEorg/sonormal Nov 10, 2023

iannesbitt added a commit that referenced this issue Nov 10, 2023

fixing no identifier bug (#47)

b6d98e5

iannesbitt added a commit that referenced this issue Nov 10, 2023

changing logging levels for #47

836ad07

iannesbitt added a commit that referenced this issue Nov 10, 2023

changing logging levels for #47

606b46a

iannesbitt added a commit that referenced this issue Nov 10, 2023

Merge pull request #48 from DataONEorg/bugfix-47

7588204

changing logging levels for #47

iannesbitt added the v0.1.1 Version 0.1.1 item label Nov 13, 2023

iannesbitt added this to the 0.1.1 milestone Nov 13, 2023

iannesbitt closed this as completed Nov 13, 2023

iannesbitt mentioned this issue Nov 16, 2023

Release v0.1.1 #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HD SO documents incorrectly rejected for `no identifier` #47

HD SO documents incorrectly rejected for `no identifier` #47

iannesbitt commented Nov 9, 2023 •

edited

Loading

iannesbitt commented Nov 10, 2023

iannesbitt commented Nov 11, 2023

iannesbitt commented Nov 11, 2023

HD SO documents incorrectly rejected for no identifier #47

HD SO documents incorrectly rejected for no identifier #47

Comments

iannesbitt commented Nov 9, 2023 • edited Loading

iannesbitt commented Nov 10, 2023

iannesbitt commented Nov 11, 2023

iannesbitt commented Nov 11, 2023

HD SO documents incorrectly rejected for `no identifier` #47

HD SO documents incorrectly rejected for `no identifier` #47

iannesbitt commented Nov 9, 2023 •

edited

Loading