Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use author identifiers in import API #10110

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Dec 3, 2024

This should be squash merged

Corresponding model update pr: internetarchive/openlibrary-client#419

This strictly expands the import schema.
It is not a breaking change.
Import records that don't include author IDs will continue to work as they currently do.

Closes #9448
Closes #9411

Technical

  • Adds support for author identifiers in import records.
    • If the import record for a book includes an author that has an "ol_id" property, the import API will attempt to find an author that matches that OL ID.
    • If the import does not include an ol_id field OR includes an ol_id field that doesn't match any existing authors, then if the import record for a book includes an author that has a "remote_ids" property, the import API will attempt to find an existing author that matches the most remote IDs within the record.
      • Q: Should the case of specifying an ol_id that doesn't exist our DB be an error that should reject the import?
    • If the record doesn't include or match any of the above, it will continue to be author-matched based on name and birth/death date, which is how the import api already operates in production.
  • Wikisource script updates:
    • Fixes incorrect birth/death date parsing.
    • Books with no identified author, title, or publish date will not be included in the jsonl output.
    • The name formatting helper function is only used when the author's name came specifically from wikisource and not from wikidata.
      • The majority of the time, WS import records produced by this script will strictly use author info from WD. However, not every WD item corresponding to a WS book is properly linked to an author. In those cases, the script falls back to attempt getting author information from the WS api response instead. WS data is highly unstructured, so only in those cases will the name formatter be used.
    • Moved dependencies specific to the wikisource script into a separate requirements.txt file that is intended to be installed only temporarily, since they're not required for OL to run. Instructions are included for how to run the script with this consideration.
    • Adds author identifiers to its output records, since it uses the Wikidata API, which includes OL IDs and most other remote_ids.
      • WS was the easiest source for me to use for generating records that had enough information to test these additions with. Nothing in the updated author matching logic is actually specific to WS, except for the next bullet point:
    • Wikisource records are exempt from being rejected for having a 1900 publish date. I don't know if this is a good idea or not, seeking feedback on that.

Issues:

  • Importing books is successful and matching authors are being found and used as expected, however navigating to the author's page from that new book's page does not show that new book on the author's page. Solr updater delay, it appeared after a while!

Testing

I put the entire output of the wikisource script into /import/batch/new.

Stakeholders

@cdrini @Freso

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

@pidgezero-one pidgezero-one marked this pull request as ready for review February 11, 2025 17:40
@pidgezero-one pidgezero-one changed the title [WIP] feat: use author known identifiers in import API feat: use author identifiers in import API Feb 11, 2025
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open questions:

  • key vs ol_id in author import record
  • remote_ids vs identifiers in author import record
    • ^ For both of these, since there are subtle differences between eg remote_ids (authors, Dict[str, str]) and identifiers (works/editions, Dict[str, list[str]]), I think it might be easiest if we re-use the shape of our existing open library records. So remote_ids: dict[str,str] for authors, and key to hold the open library key.
  • Should any identifier conflicts cause import error?
    • As a first stab, let's err on precaution, and error on any identifier conflicts.

return authors

# Look for OL ID first.
if (key := author.get("ol_id")) and (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to name this one as key to be consistent with our book/thing records. Having the import endpoint mirror the shape of our core book records is convenient.

Suggested change
if (key := author.get("ol_id")) and (
if (key := author.get("key")) and (

)
):
# Always match on OL ID, even if remote identifiers don't match.
return get_redirected_authors([web.ctx.site.get(k) for k in reply])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use get_many here to get them all in one db request. You might have to wrap the result in list().

Suggested change
return get_redirected_authors([web.ctx.site.get(k) for k in reply])
return get_redirected_authors(list(web.ctx.site.get_many(reply))

matched_authors = []
# Get all the authors that match any incoming identifier.
for identifier, val in identifiers.items():
queries.append({"type": "/type/author", f"remote_ids.{identifier}~": val})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this piece we want the exact match for the identifier; the ~ will parse things like * as wildcards. eg identifier~=foo* will find all authors with IDs starting with foo.

Suggested change
queries.append({"type": "/type/author", f"remote_ids.{identifier}~": val})
queries.append({"type": "/type/author", f"remote_ids.{identifier}": val})

for query in queries:
if reply := list(web.ctx.site.things(query)):
matched_authors.extend(
get_redirected_authors([web.ctx.site.get(k) for k in reply])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot for get_many

Suggested change
get_redirected_authors([web.ctx.site.get(k) for k in reply])
get_redirected_authors(list(web.ctx.site.get_many(reply))

matched_authors.extend(
get_redirected_authors([web.ctx.site.get(k) for k in reply])
)
matched_authors = uniq(matched_authors)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how Thing implements hashable (since uniq puts the items in a set), but this might be slightly more performant.

Suggested change
matched_authors = uniq(matched_authors)
matched_authors = uniq(matched_authors, key=lambda thing: thing.key)

"death_date": author.death_date,
**(
{"birth_date": author.birth_date}
if author.birth_date is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if author.birth_date is not None
if author.birth_date

),
**(
{"death_date": author.death_date}
if author.death_date is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if author.death_date is not None
if author.death_date

{
"identifiers": author.identifiers,
}
if len(author.identifiers.keys()) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(author.identifiers.keys()) > 0
if author.identifiers

if len(author.identifiers.keys()) > 0
else {}
),
**({"ol_id": author.ol_id} if author.ol_id is not None else {}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**({"ol_id": author.ol_id} if author.ol_id is not None else {}),
**({"ol_id": author.ol_id} if author.ol_id else {}),

val = obj[id]["value"]
if id == "youtube" and val[0] != "@":
val = f'@{val}'
contributor.identifiers[id] = obj[id]["value"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
contributor.identifiers[id] = obj[id]["value"]
contributor.identifiers[id] = val

@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed]
Projects
None yet
2 participants