feat: use author identifiers in import API #10110

pidgezero-one · 2024-12-03T02:10:48Z

This should be squash merged

Corresponding model update pr: internetarchive/openlibrary-client#419

This strictly expands the import schema.
It is not a breaking change.
Import records that don't include author IDs will continue to work as they currently do.

Closes #9448
Closes #9411

Technical

Adds support for author identifiers in import records.
- If the import record for a book includes an author that has an "ol_id" property, the import API will attempt to find an author that matches that OL ID.
- If the import does not include an ol_id field OR includes an ol_id field that doesn't match any existing authors, then if the import record for a book includes an author that has a "remote_ids" property, the import API will attempt to find an existing author that matches the most remote IDs within the record.
  - Q: Should the case of specifying an ol_id that doesn't exist our DB be an error that should reject the import?
- If the record doesn't include or match any of the above, it will continue to be author-matched based on name and birth/death date, which is how the import api already operates in production.
Wikisource script updates:
- Fixes incorrect birth/death date parsing.
- Books with no identified author, title, or publish date will not be included in the jsonl output.
- The name formatting helper function is only used when the author's name came specifically from wikisource and not from wikidata.
  - The majority of the time, WS import records produced by this script will strictly use author info from WD. However, not every WD item corresponding to a WS book is properly linked to an author. In those cases, the script falls back to attempt getting author information from the WS api response instead. WS data is highly unstructured, so only in those cases will the name formatter be used.
- Moved dependencies specific to the wikisource script into a separate requirements.txt file that is intended to be installed only temporarily, since they're not required for OL to run. Instructions are included for how to run the script with this consideration.
- Adds author identifiers to its output records, since it uses the Wikidata API, which includes OL IDs and most other remote_ids.
  - WS was the easiest source for me to use for generating records that had enough information to test these additions with. Nothing in the updated author matching logic is actually specific to WS, except for the next bullet point:
- Wikisource records are exempt from being rejected for having a 1900 publish date. I don't know if this is a good idea or not, seeking feedback on that.

Issues:

Importing books is successful and matching authors are being found and used as expected, however navigating to the author's page from that new book's page does not show that new book on the author's page. Solr updater delay, it appeared after a while!

Testing

I put the entire output of the wikisource script into /import/batch/new.

Stakeholders

@cdrini @Freso

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

for more information, see https://pre-commit.ci

…ver hardcoded IDs

for more information, see https://pre-commit.ci

…r own file

for more information, see https://pre-commit.ci

cdrini

Open questions:

key vs ol_id in author import record
remote_ids vs identifiers in author import record
- ^ For both of these, since there are subtle differences between eg remote_ids (authors, Dict[str, str]) and identifiers (works/editions, Dict[str, list[str]]), I think it might be easiest if we re-use the shape of our existing open library records. So remote_ids: dict[str,str] for authors, and key to hold the open library key.
Should any identifier conflicts cause import error?
- As a first stab, let's err on precaution, and error on any identifier conflicts.

cdrini · 2025-02-20T17:37:37Z