Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples: add author_comparator simple demo #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chris-asl
Copy link

Signed-off-by: Chris Aslanoglou chris.aslanoglou@gmail.com


I've created this example so that we can see the behaviour of the AuthorComparator ontop of inspire-json-merger.
For making it easier to identify whether the comparator matched the names, I changed the merger_config for this example.

@michamos
Copy link
Contributor

michamos commented Oct 6, 2017

For context: @ksachs @jmartinm the AuthorComparatoris something that is running deep inside the merger, and cannot currently easily used independently. Here @chris-asl and @ammirate configured the json-merger that internally uses the comparator to demonstrate in isolation that it works.

@chris-asl chris-asl force-pushed the examples-author-comparator branch from 007aeaa to 8775173 Compare October 8, 2017 18:06
@ksachs
Copy link

ksachs commented Oct 9, 2017

Do we agree on what should match and what shouldn't?
Do you take name-variants into account?
What about initials vs full first name?
Can we deal with (UTF8) special characters?

E.g.
'Zwikel, C.' should match
(can we make 'Zwikel, Celine' match)

"full_name": "Zwikel, C\u00e9line", 
"name_variations": [
    "C Zwikel", 
    "C\u00e9line Zwikel", 
    "Zwikel", 
    "Zwikel C", 
    "Zwikel C\u00e9line", 
    "Zwikel, C", 
    "Zwikel, C\u00e9line"
]

'Gaspari, Massimo' should match (it should be better than not matching)

"full_name": "Gaspari, M.", 
"name_variations": [
    "Gaspari", 
    "Gaspari M", 
    "Gaspari, M", 
    "M Gaspari"
]

'Gaspari, Massimo' should not match

"full_name": "Gaspari, M.", 
"name_variations": [
    "Gaspari", 
    "Gaspari M", 
    "Gaspari Maria", 
    "Gaspari, M", 
    "Gaspari, Maria", 
    "M Gaspari",
    "Maria Gaspari"
]

@kaplun
Copy link

kaplun commented Oct 9, 2017

"full_name": "Gaspari, M.", 
"name_variations": [
    "Gaspari", 
    "Gaspari M", 
    "Gaspari Maria", 
    "Gaspari, M", 
    "Gaspari, Maria", 
    "M Gaspari",
    "Maria Gaspari"
]

How can the full_name sport initials, when the name_variations have Maria?

@michamos
Copy link
Contributor

michamos commented Oct 9, 2017

AFAIK, name_variations are used for searching on author as opposed to exactauthor but are not used by the AuthorComparator. They are derived from full_name automatically, so do not contain additional information: https://github.com/inspirehep/inspire-next/blob/f64f0eee6a5a2c0eb996352f1f10371eb64cf530/inspirehep/modules/records/receivers.py#L369-L390.
So only the full_name needs to be considered. And Gaspari, Massimo should match Gaspari, M. but not Gaspari, Maria indeed.

@ksachs
Copy link

ksachs commented Oct 9, 2017

bad made-up example. Most likely it would be

"full_name": "Gaspari, Maria",   
"name_variations": [
    "Gaspari", 
    "Gaspari M", 
    "Gaspari Maria", 
    "Gaspari, M", 
    "Gaspari, Maria", 
    "M Gaspari",
    "Maria Gaspari"
]

@ksachs
Copy link

ksachs commented Oct 9, 2017

if name variant can not be used in the AuthorComparator maybe we should strip everything to initials for the matcher. Better some false positives but avoid false negatives.
I was thinking too smart. I thought name variants are taken from the BAI custer.

@kaplun
Copy link

kaplun commented Oct 9, 2017

Yeah, name_variants are just automatically derived from full_name so can be ignored. The author_comparator will comparate only real information.

@michamos
Copy link
Contributor

michamos commented Oct 9, 2017

do you really want Massimo and Maria to match? those are pretty different names.

@ksachs
Copy link

ksachs commented Oct 9, 2017

If possible I want
a match for 'M.' - 'Massimo'
no match for 'Maria' - ' Massimo'
But if this is not possible I rather accept a match for 'Maria' - ' Massimo' instead of missing 'M.' - 'Massimo'.
Maybe add an example like ['Ellis, J.', 'Ellis, John'] to the code.

@chris-asl chris-asl force-pushed the examples-author-comparator branch from 8775173 to 1f038c9 Compare October 9, 2017 13:38
@chris-asl
Copy link
Author

chris-asl commented Oct 9, 2017

So, by adding the examples by @ksachs, we're getting these results:

[match]: "Cox, Brian", "Cox, Brian"
[match]: "O Brien, Dara", "O Briain, Dara"
[NO match]: "O Brien, Dara", "Christos Aslanoglou"
[match]: "John, Ellis", "John ellis"
[match]: "John, Ellis", "John, Richard Ellis"
[match]: "John, Ellis", "John, R. Ellis"
[match]: "John, Ellis", "Ellis, R. John"
[match]: "John, Ellis", "Ellis, R. J"
[match]: "John, Ellis", "j, r. ellis"
[NO match]: "John, Ellis", "j, r, ellis"
[match]: "John, Ellis", "j r ellis"
[match]: "John, Ellis", "j richard ellis"
[match]: "J Ellis", "john r e"
[match]: "Ellis J", "john r e"
[match]: "Ellis, J.", "Ellis, John"
[match]: "John R. Ellis", "j richard ellis"
[match]: "John R. Ellis", "j r ellis"
[match]: "John R. Ellis", "john rich. e"
[match]: "Παπαδόπουλος, Γ", "Παπαδόπουλος Γιώργος"
[match]: "Παπαδόπουλος, Γ", "Παπαδόπουλος Μ Γιώργος"
[match]: "Παπαδόπουλος, Γ", "Παπαδόπουλος Μιχάλη Γ"
[NO match]: "Jhon Brian", "John Brian"
[NO match]: "John Brian", "John C"
[match]: "Gaspari", "Gaspari, Maria"
[match]: "Gaspari M", "Gaspari, Maria"
[match]: "Gaspari Maria", "Gaspari, Maria"
[match]: "M Gaspari", "Gaspari, Maria"
[match]: "Maria Gaspari", "Gaspari, Maria"
[match]: "Sunje, Dallmeier-Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Suenje, Dallmeier-Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Sünje, Dallmeier-Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Sunje, Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Sünje, Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Suenje, Tiessen", "Dallmeier-Tiessen, Sunje"
[match]: "Sunje, Dallmeier", "Dallmeier-Tiessen, Sunje"
[match]: "Sünje, Dallmeier", "Dallmeier-Tiessen, Sunje"
[match]: "Suenje, Dallmeier", "Dallmeier-Tiessen, Sunje"

@michamos
Copy link
Contributor

michamos commented Oct 9, 2017

@chris-asl she meant Massimo and Maria as first names, so Gaspari, Massimo, Gaspari, Maria and Gaspari, M..

@kaplun
Copy link

kaplun commented Oct 9, 2017

That's quite strong matching. What about:
"Sunje, Dallmeier-Tiessen", "Sünje, Dallmeier-Tiessen", "Suenje, Dallmeier-Tiessen", "Sunje, Tiessen", "Sünje, Tiessen", "Suenje, Tiessen", "Sunje, Dallmeier", "Sünje, Dallmeier", "Suenje, Dallmeier",

@ksachs
Copy link

ksachs commented Oct 9, 2017

Is this the same AuthorComparator I'm using in `invenio-matcher-benchmark'?
I get no match for "Gaspari, M", "Gaspari, Massimo".

[match]: "O Brien, Dara, "O Briain, Dara"
How do you know this is a spelling variant and not a different name?

@chris-asl
Copy link
Author

chris-asl commented Oct 9, 2017

@kaplun I've updated my comment with your suggestions.

@ksachs: As discussed with @ammirate, regarding the O Brien, Dara vs O Briain, Dara, this was merely a test to see how the matcher behaves. With the current configuration of the matcher, along with my change here, it seems that the matcher "thinks" that's a match.

Signed-off-by: Chris Aslanoglou <chris.aslanoglou@gmail.com>
Co-Authored-By: Antonio Cesarano <cesarano2607@gmail.com>
@chris-asl chris-asl force-pushed the examples-author-comparator branch from 1f038c9 to b302526 Compare October 9, 2017 14:53
MJedr pushed a commit to MJedr/inspire-json-merger that referenced this pull request Sep 13, 2021
* Removes incorrect entry points that causes Flask application to crash
  when invenio-base is installed. (closes inspirehep#27)

Signed-off-by: Javier Martin Montull <javier.martin.montull@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants