Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OpenAlex translators #3379

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

thebluepotato
Copy link
Contributor

In OpenAlex JSON.js:

  • Fixed the JSON translator expecting keys that were sometimes null (such as title or primary_location).
  • Saves the OpenAlex identifier in the Extra field instead of the whole URL
  • Saves the item's URL as well
  • Adds a DOI fallback in case the item's data was seriously lacking

In OpenAlex.js:

  • Add support for detecting and parsing multiple identifiers at once
  • Improve OpenAlex ID matching and cleaning

@@ -114,7 +114,7 @@ module.exports = {
}
else if (testCase.type === 'search') {
// console.log(JSON.stringify(testCase.input))
const expected = ['DOI', 'ISBN', 'PMID', 'arXiv', 'identifiers', 'contextObject', 'adsBibcode', 'ericNumber', 'openAlex'];
const expected = ['DOI', 'ISBN', 'PMID', 'arXiv', 'identifiers', 'contextObject', 'adsBibcode', 'ericNumber', 'OpenAlex'];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd align with the way it's usually written and displayed. For instance, arXiv is not lowercase, DOI is uppercase, etc. I'm fine with anything in the end, but wouldn't it be better if the label was case insensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-camelcase key names for DOI, ISBN, and PMID have caused us some grief, which is why we haven't done that with adsBibcode (ADS Bibcode) or ericNumber (ERIC number).

Comment on lines +62 to +70
if (items[i].OpenAlex && (oaID = cleanOpenAlexID(items[i].OpenAlex))) {
oaIDs.push(oaID);
}
else if (items[i].openAlex && (oaID = cleanOpenAlexID(items[i].openAlex))) {
oaIDs.push(oaID);
}
else if (typeof items[i] == 'string' && (oaID = cleanOpenAlexID(items[i]))) {
oaIDs.push(oaID);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely no need to support multiple capitalizations or string inputs. Callers just need to pass valid input, which is of the form { openAlex: '...' }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to support only one capitalisation (preferably OpenAlex). However for just the string, I believe the arXiv translator at least also supports strings in addition to objects.

Copy link
Member

@AbeJellinek AbeJellinek Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However for just the string, I believe the arXiv translator at least also supports strings in addition to objects.

Some search translators that have been around for a very long time do, but arXiv doesn't, at least as of the current version. Zotero will always pass a single object, so there's no need to handle anything but single objects anymore.

(The only exception is if we want to allow another translator to call the search translator in some specific nonstandard way, but I think there's usually a way to design around that.)

OpenAlex.js Outdated Show resolved Hide resolved
Comment on lines +197 to +199
let openAlexID = data.ids.openalex.match(/W\d+$/i)[0];
item.setExtra("OpenAlex", openAlexID);
item.libraryCatalog = "OpenAlex";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning for this? I think we settled on using openalex.org/ URLs as canonical identifiers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you settled on in the past, but my suggestion is to focus on the identifier. Sure, OpenAlex stores them with the URL, but even their documentation shows that the actual identifier for works is just the part that starts with W. In the end, it's the same reason why we canonically store just the identifier for arXiv or DOI, even though you can build the URLs with them. Lastly, it's just more terse and legible IMO. In any event, the translator supports searching with both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we used the version with domain is that that's defined as the OpenAlex ID by OpenAlex:
https://docs.openalex.org/how-to-use-the-api/get-single-entities#the-openalex-id

The OpenAlex ID is the primary key for all entities. It's a URL shaped like this: https://openalex.org/<OpenAlex_key>. Here's a real-world example:

https://openalex.org/W2741809807

I'm not supper opposed to handling this differently, and I can see this looks like it's inconsistent, but for DOIs, PMIDs, and arXiv IDs, the ID explicitly does not include the URL schema, so this isn't the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help if we internally agreed that we're not storing the OpenAlex ID per se but the OpenAlex key?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we would be doing, but I'm not convinced going with the key is the right choice. Legibility is obviously a matter of taste, but storing identifiers in a format in which they're self-identifying seems like a good idea: DOIs and ISBNs are to by virtue of their format; arXiv IDs canonically include the arxiv: namespace (yes, we do store the openAlex: label in Extra, but that's as part of a key:value pair, not as part of the identifier). PMIDs are a constant problem. OpenAlex keys are marginally better because of the W, but only marginally so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with the URLs for backwards compatibility and disambiguation. We can allow searching by bare W[...] IDs, of course.

@AbeJellinek
Copy link
Member

Thanks! Most looks good, but I'm confused about a few changes (above).

@thebluepotato
Copy link
Contributor Author

Regarding the label: to give more context: I contributed to the diegodlh/zotero-cita#300 extension, where we fetch and store identifiers and references from multiple sources, including OpenAlex. There, we display the identifiers (both "officially" supported ones like arXiv, DOI, ISBN and unofficial ones like OpenAlex, Semantic Scholar's CorpusID, etc.) in editable fields, with a button to open the corresponding URL (just like the standard DOI field). These extra identifiers are stored in the Extra field, and there, capitalization should be somewhat uniform. My eyes bleed I'm slightly put off seeing openAlex: xxx there instead of OpenAlex: xxx.

Regarding storing the URL, the documentation is sometimes a little vague about the key/ID distinction. As they do consider the Key the "unique primary key that identifies a given resource in our database", I think the value stored should definitely be that one. For ease-of-use of the URL, as I mentioned in our use case, the extension constructs the link automatically (just like Zotero does for the DOI). As to the "recognizability" of the identifier, I see your point, but I'm not sure that I agree with regards to "disregarding" the field label. If I know nothing of English literature, seeing the field value "Jane Eyre" I would identify it as an author, and not a book title. Not to go all too hermeneutical, but context is not only helpful to give meaning, but necessary. In other words, I believe the label to be essential to give meaning to most field values, and not just some internal key in a dictionary. Lastly, while it's true that W isn't much of a distinguishing factor, it is one and as soon as you see one or two (with the field label), the association is made in your head that this is an OpenAlex key.

Sorry if I'm sounding/being a bit anal, in the end it won't change the world whatever you/we decide, I just wanted to clarify my reasoning. I'll perform the other changes requested ASAP

@AbeJellinek
Copy link
Member

AbeJellinek commented Nov 22, 2024

These extra identifiers are stored in the Extra field, and there, capitalization should be somewhat uniform. My eyes bleed I'm slightly put off seeing openAlex: xxx there instead of OpenAlex: xxx.

Totally fine - we can store the ID in Extra as OpenAlex: [...] while still using openAlex as the search field. We use Title Case in Extra and camelCase in the field name itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants