-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistency: Should we exclude creator
from software_name
?
#641
Comments
creator
into the software_name annotation?creator
into software_name
?
creator
into software_name
?creator
from software_name
?
So the rule should be:
If there is a "instrument-like" citation following the
Otherwise we code creator in the preceding publisher.
|
What do you think @Kermit2? |
This is indeed exactly the rules I tried to follow for having some consistency - except for the raised cases like GraphPad Prism and Lotus Notes for which the "publisher" name is so commonly attached to the actual software name that it's only after reviewing many paragraphs that I realized that the rule was not applied. I think it makes sense however to apply the rules systematically, so having We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>. and <rs type="creator">Lotus</rs> <rs type="software">Notes</rs> |
I have to confess also a bias :) I think I kept those few exceptions like If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having |
Yeah, I mean the reality is that some software has the publisher in the
name, and even the publisher uses that. I definitely see that. But we
need some sort of consistency here, no?
…On Tue, Oct 29, 2019 at 6:50 PM Patrice Lopez ***@***.***> wrote:
I have to confess also a bias :)
I think I kept those few exceptions like Lotus Notes, because I had in
mind the problem of disambiguation/matching of the software mention in
existing software knowledge bases. I know that after extracting all
software name mention, we want to deduplicate them and match them to a
software "entity".
If you look at the "labels" for the Wikidata entity for *Lotus Notes*, at
https://www.wikidata.org/wiki/Q60198 you see that they all contain the
publisher name. So having this bias helps the matching, not having Lotus
and just Notes make the matching a bit more complicated as we need to
combine different extracted fields.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#641?email_source=notifications&email_token=AAAWOUSC66OFLQG6B3VKYMTQRDD4JA5CNFSM4JEL3TA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSOTEQ#issuecomment-547678610>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWOUVST64MSE527NMGIHTQRDD4JANCNFSM4JEL3TAQ>
.
|
I am moving some existing issues into standalone posts to increase their visibility. I am also thinking whether the additional correction should be made into new rules for future annotation work.
The first one is what we've debated for some time. For
software_name
annotations like Microsoft Excel, GraphPad Prism, Lotus Notes, we've annotated the creator name inside thesoftware_name
as a separate entity in post-processing. Apart from their semantic difference and the introduced ambiguities, one big concern brought up by @kermitt2 earlier is to avoid overlapping annotations since they will become knotty in tei xml.Currently in our dataset, GraphPad Prism are usually put together in
software_name
. In some cases Microsoft is separately annotated ascreator
while the correspondingsoftware_name
is annotated as Excel; but we also have tricky examples like MS+Excel/MS Excel/Microsoft+Office Excel, etc. e.g.:Sometimes there's additional
creator
info accompanied and @kermitt2 treats them as the primarycreator
information to be annotated in the current candidate release. The first occurance of Microsoft before Excel is thus skipped in such cases. For instance:If we strictly limit only one annotated string in each annotation field, then we would want to set a rule here for establishing this priority (e.g., annotate the full organizational name and ignore the Microsoft before the software name) for future annotating. The same as the case of GraphPad Prism.
Generally speaking, I think it's reasonable to separate GraphPad from Prism and do the same thing to Microsoft/MS Excel. Perhaps IBM Notes is a hypothesized example as we only have one instance of Lotus Notes in the candidate tei xml. Even if it occurs, seems to me annotating separate entities here is better?
The text was updated successfully, but these errors were encountered: