Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes miss-parsed names in AutomaticPersonsGroup #7228

Merged
merged 25 commits into from
Jan 18, 2021

Conversation

k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020

Fixes #5833.

This fixes the parsing and matching of authors' last names. It is much slower than the previous solution, and uses more RAM, but I think that is hard/impossible to avoid. However, it is possible that the caching solution can be made smarter, but I leave the decision if it should, to the reviewers 😁

An alternative solution is moving the caching to AuthorList, it will make it easier to modify/improve at a later point in time. It will make this PR take quite a bit of more time as I'd like to make AuthorList extend List in that case. See #6552 (comment) for a much longer discussion on that topic.
Given that the only thing that should be cached are the last names (and always latex-free), perhaps it has more in common with the autocomplete functionality. I have not looked into that.

  • Change in CHANGELOG.md described (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 marked this pull request as ready for review December 29, 2020 17:31
@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 changed the title [WIP] Fixes miss-parsed names in AutomaticPersonsGroup Fixes miss-parsed names in AutomaticPersonsGroup Dec 29, 2020
@tobiasdiez tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 29, 2020
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what was the problem in the end? It looks like the code is more or less equivalent to the one before, except that now the author parsing and latex2unicode conversion are in a different order (which does make sense!). Is this the only change?

Concerning the caching, I think this is not needed. The parsing of a string to an authorlist is already cached, and getting the last names with latex converted is also cached:

public String getAsLastNamesLatexFree(boolean oxfordComma) {

However, this code probably needs to be changed a bit to be usable for our purposes. I would suggest to add a getLastNameLatexFree in Author, which caches it's result. This method then can be used in the groups parsing, as well as in AuthorList.getAsLastNamesLatexFree (there is no need to have an additional cache there again).

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020

So, what was the problem in the end? It looks like the code is more or less equivalent to the one before, except that now the author parsing and latex2unicode conversion are in a different order (which does make sense!). Is this the only change?

The other change is that the previous solution relied on the existing WordKeywordGroup SearchStrategies which did not parse authors. Hence, "Turing, Allan" ended up attempting to match the group "Turing" with the last name "Turing," (note the ,). There might be other differences as well but my "contribution" is to use the same code that parses the names for the groups to check if an author matches.
Note that among the test cases it is only createSubgroupsContainingCommaSeparatedLastNames that fails without these changes.

Now that I am looking at it again, I'll see if I can extend KeywordGroup instead. It should make the changes more readable.

Concerning the caching, I think this is not needed. The parsing of a string to an authorlist is already cached, and getting the last names with latex converted is also cached:

public String getAsLastNamesLatexFree(boolean oxfordComma) {

However, this code probably needs to be changed a bit to be usable for our purposes. I would suggest to add a getLastNameLatexFree in Author, which caches it's result. This method then can be used in the groups parsing, as well as in AuthorList.getAsLastNamesLatexFree (there is no need to have an additional cache there again).

In some cases the getAsLastNamesLatexFree contain "and" between the two last authors. Given that a latex-free string is allowed to contain "and" for other reasons (e.g., {Barnes and Noble}) I'd rather avoid using a plain String.
I don't have an opinion regarding adding an Author#getLastNameLatexFree method. Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree" in which case the AUTHOR_CACHE would cache latexfree AuthorLists, but in my opinion, it is a bit of a nightmare to implement.
Should I add the Author#getLastNameLatexFree method? At the very least it would allow

.map(Author::getLastNameLatexFree)

and removing the caching for the AuthorList method.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020

Just a note... It seems that none of the groups use BibDatabase which means they all use entry.getField instead of entry.getResolvedFieldOrAlias.
I'd argue it would be out of scope for this issue, but it could lead to problems (perhaps it is unlikely to be used for the author/editor fields).

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020

I just realized that AuthorList.getAsLastNamesLatexFree uses getLastOnly, not getLast. So I can't reuse the last name cache easily in AuthorList. I think that for the AutomaticPersonGroup it makes sense that the subgroup is named Neumann and not von Neumann.

@tobiasdiez
Copy link
Member

tobiasdiez commented Dec 31, 2020

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

I completely agree! The Author class should be responsible to calculate the latex-free version of names etc, and cache its result. Maybe, the most readable version is a method Author latexfree() which returns the complete latex2unicode converted representation of the author, so you can write author.latexfree().getLast(). That's something of value even beyond this ticket!
Would also make it pretty easy to cache in a flexible way by having only Author latexfree as a cached variable. It's also pretty extensible you get automatic caching for methods like getLastOnly.

Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree"

That would be indeed a good solution, but is sadly not possible. {Barnes and Noble} and Author is sadly different to Barnes and Noble and Author. The author specification of bibtex is just a mess, and relies on so many assumptions. The best would be if people would only rely on the explicit name format of biber/biblatex: AUTHOR = {given=Arnar, family=Vigfusson}, but that's a very long way to go.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 31, 2020

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

I completely agree! The Author class should be responsible to calculate the latex-free version of names etc, and cache its result. Maybe, the most readable version is a method Author latexfree() which returns the complete latex2unicode converted representation of the author, so you can write author.latexfree().getLast(). That's something of value even beyond this ticket!
Would also make it pretty easy to cache in a flexible way by having only Author latexfree as a cached variable. It's also pretty extensible you get automatic caching for methods like getLastOnly.

Perhaps this is the good way out of this... I just have this itch to kill the Author class and make it a Map, preferably an EnumMap (thought that might not be possible)... I might be tunnel-visioning on that 😁
I'd argue it is out of scope of this issue. Rewriting the AuthorList#getAs... would be (one) of the benefits but it would be quite time consuming. Additionally, if someone wants to track down possible performance bottlenecks, it would be better in its own PR. You think opening up a new PR with Author.latexFree (Author.resolveLatex?) is the way to go (rather than continuing it here)?

Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree"

That would be indeed a good solution, but is sadly not possible. {Barnes and Noble} and Author is sadly different to Barnes and Noble and Author. The author specification of bibtex is just a mess, and relies on so many assumptions. The best would be if people would only rely on the explicit name format of biber/biblatex: AUTHOR = {given=Arnar, family=Vigfusson}, but that's a very long way to go.

In my opinion, that solution is possible. However, it would require changes to the author parser so that delimiters can be reconstructed. The surrounding braces are removed by Author, internal braces might be a bigger challenge. It would also tie in to the discussion about the mess outside the bibtex/biber/biblatex definitions of authors, if #2205 is still an issue (I have been eyeing it but not found the time to look at the details).

And thank you for multiple code review on new year as well! 🎆

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current changes are fine with me. The author caching of latexfree things can be done in a follow-up PR.

tobiasdiez
tobiasdiez previously approved these changes Jan 4, 2021
@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 dismissed tobiasdiez’s stale review via edd24bd

Oups. Sorry, I did not know that was a thing. I just wanted to play around with what went wrong with the equality and hashCode (barring my thinking 🤡 ), since it should not hurt to have more "standard" equality and hashCode methods.

tobiasdiez
tobiasdiez previously approved these changes Jan 13, 2021
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also only recently noticed this. I guess it's coming from the recent changes that PRs need to be reviewed before they are merged (that was always a standing requirement, but now github is configured for it is well). @koppor

@koppor
Copy link
Member

koppor commented Jan 18, 2021

k3KAW8Pnf7mkmdSMPHz27 dismissed tobiasdiez’s stale review via edd24bd

This is some automatic thing. Any change will dismiss a review. We want to be sure that approvals are "done right". The final approver can check whether the changes from the dismissed approval to the current one are OK.

Thus, no worries! 🎉 - And thank you for continuing on working on this.

@koppor koppor merged commit 31ced14 into JabRef:master Jan 18, 2021
Siedlerchr added a commit that referenced this pull request Jan 22, 2021
…dtask

* upstream/master:
  Upgrade citeproc to 3.x snapshot without graal (#7370)
  Fix Exception if no AzureInstrumentationKey is available (#7373)
  Update snapcraft source url (#7372)
  Fix checkstyle and adjust language
  GitBook: [master] 3 pages and 32 assets modified
  Add migration to special field (#7368)
  GitBook: [master] 5 pages and 29 assets modified
  Modify message at the duplicates found dialog (#7231)
  Fixes miss-parsed names in `AutomaticPersonsGroup` (#7228)
  Fix an issue where the password for a shared SQL database was only remembered if it was the same as the username (#7364)
  Fix harvard exporter by changing AuthorsFormatter (#7355)
  Bump styfle/cancel-workflow-action from 0.6.0 to 0.7.0 (#7363)
  Bump mockito-core from 3.7.0 to 3.7.7 (#7360)
  Bump org.beryx.jlink from 2.23.1 to 2.23.2 (#7361)
  Bump libreoffice from 7.0.3 to 7.0.4 (#7362)
  Export full month name instead of number in ms office (#7358)
Siedlerchr added a commit to koppor/jabref that referenced this pull request Jan 26, 2021
* upstream/master: (217 commits)
  Fix handling of URL in file field (JabRef#7347)
  Fix expansion of bracketed expressions in RegExpBasedFileFinder (JabRef#7338)
  Refactor unlinked files (JabRef#7209)
  Add pressing enter when the search field is focused as a way to trigger search (JabRef#7377)
  Upgrade citeproc to 3.x snapshot without graal (JabRef#7370)
  Fix Exception if no AzureInstrumentationKey is available (JabRef#7373)
  Update snapcraft source url (JabRef#7372)
  Fix checkstyle and adjust language
  GitBook: [master] 3 pages and 32 assets modified
  Add migration to special field (JabRef#7368)
  GitBook: [master] 5 pages and 29 assets modified
  Modify message at the duplicates found dialog (JabRef#7231)
  Fixes miss-parsed names in `AutomaticPersonsGroup` (JabRef#7228)
  Fix an issue where the password for a shared SQL database was only remembered if it was the same as the username (JabRef#7364)
  Fix harvard exporter by changing AuthorsFormatter (JabRef#7355)
  Bump styfle/cancel-workflow-action from 0.6.0 to 0.7.0 (JabRef#7363)
  Bump mockito-core from 3.7.0 to 3.7.7 (JabRef#7360)
  Bump org.beryx.jlink from 2.23.1 to 2.23.2 (JabRef#7361)
  Bump libreoffice from 7.0.3 to 7.0.4 (JabRef#7362)
  Export full month name instead of number in ms office (JabRef#7358)
  ...

# Conflicts:
#	external-libraries.md
#	src/main/java/module-info.java
#	src/main/java/org/jabref/gui/shared/SharedDatabaseLoginDialogViewModel.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"New Group" > "Specified keywords" > "Generate groups for author last names" not working properly
5 participants