Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author normalization for last-name only searches overly greedy #194

Open
aaccomazzi opened this issue Nov 22, 2022 · 4 comments
Open

Author normalization for last-name only searches overly greedy #194

aaccomazzi opened this issue Nov 22, 2022 · 4 comments

Comments

@aaccomazzi
Copy link
Member

A search for author:"Gaia Collaboration" ends up finding all papers with "Collaboration" in their author field. It looks like this is due to the normalization of author names which happens when the string does not contain a comma. The intent of this normalization is for the parser is to rearrange the tokens so that a search for author:"First Last" will include results which match author:"Last, First"

Here is the output from the solr console in debug mode:

author:gaia collaboration, | author:gaia collaboration,* | author:collaboration, gaia | author:collaboration, gaia * | author:collaboration, g | author:collaboration, g * | author:collaboration, | author:collaboration,*

Where we see the presence of the term author:collaboration,* which should not be include in the search.

@kelockhart
Copy link
Member

Possible option: heuristics based on known keywords (e.g. collaboration) - would need to ask curators for a list.

@aaccomazzi
Copy link
Member Author

Relevant to this: we have been considering properly indexing collaborations in a separate field (although we haven't done anything about this in years). If that were the case, maybe this problem would partly go away.

But as an alternative interim solution, I'd consider dropping the last two search tokens (author:collaboration, | author:collaboration,*) which are inherited from the properly fielded author searches (Last, First) and don't apply here.

@aaccomazzi
Copy link
Member Author

Another example which is problematic: author:"JWST Transiting Exoplanet Community Early Release Science Team" does not find the paper 2023Natur.614..649J which has it as an author, presumably for similar reasons.
(Note: this query actually finds the paper: author:"JWST Transiting Exoplanet Community Early Release Science Team*").

@aaccomazzi
Copy link
Member Author

Another case where this bug is biting us in the behind: author:"anna kelbert" returns papers written by "Mark Kelbert" because of the wildcard search (kelbert,*)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants