Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency: Should we exclude creator from software_name? #641

Open
caifand opened this issue Oct 23, 2019 · 5 comments
Open

Consistency: Should we exclude creator from software_name? #641

caifand opened this issue Oct 23, 2019 · 5 comments

Comments

@caifand
Copy link
Contributor

caifand commented Oct 23, 2019

I am moving some existing issues into standalone posts to increase their visibility. I am also thinking whether the additional correction should be made into new rules for future annotation work.

The first one is what we've debated for some time. For software_name annotations like Microsoft Excel, GraphPad Prism, Lotus Notes, we've annotated the creator name inside the software_name as a separate entity in post-processing. Apart from their semantic difference and the introduced ambiguities, one big concern brought up by @kermitt2 earlier is to avoid overlapping annotations since they will become knotty in tei xml.

Currently in our dataset, GraphPad Prism are usually put together in software_name. In some cases Microsoft is separately annotated as creator while the corresponding software_name is annotated as Excel; but we also have tricky examples like MS+Excel/MS Excel/Microsoft+Office Excel, etc. e.g.:

<p>All statistical analyses were performed using paired Student's t tests and <rs corresp="#PMC3025493-software-3" type="creator">Microsoft</rs> <rs type="software" xml:id="PMC3025493-software-3">Excel</rs> or <rs type="software">Prism</rs> software packages.

Calculations were made using <rs type="software">MS Excel</rs> and are presented in Appendix 1.

used to summarise the analytic outputs using <rs corresp="#PMC5435264-software-0" type="creator">MS</rs>
          <rs type="software" xml:id="PMC5435264-software-0">Excel</rs>.

Sometimes there's additional creator info accompanied and @kermitt2 treats them as the primary creator information to be annotated in the current candidate release. The first occurance of Microsoft before Excel is thus skipped in such cases. For instance:

Observed heterozygosity was estimated in Microsoft <rs type="software" xml:id="PMC4103605-software-13">Excel</rs> (<rs corresp="#PMC4103605-software-13" type="creator">Microsoft Corporation</rs>, Redmond, Washington, USA).

If we strictly limit only one annotated string in each annotation field, then we would want to set a rule here for establishing this priority (e.g., annotate the full organizational name and ignore the Microsoft before the software name) for future annotating. The same as the case of GraphPad Prism.

Generally speaking, I think it's reasonable to separate GraphPad from Prism and do the same thing to Microsoft/MS Excel. Perhaps IBM Notes is a hypothesized example as we only have one instance of Lotus Notes in the candidate tei xml. Even if it occurs, seems to me annotating separate entities here is better?

@caifand caifand changed the title Consistency: Should we include creator into the software_name annotation? Consistency: Should we include creator into software_name? Oct 23, 2019
@caifand caifand changed the title Consistency: Should we include creator into software_name? Consistency: Should we exclude creator from software_name? Oct 23, 2019
@jameshowison
Copy link
Contributor

jameshowison commented Oct 29, 2019

So the rule should be:

software_name never includes a preceding publisher.

If there is a "instrument-like" citation following the software_name then we code creator in there but not preceding publishers.

MS <rs type="software_name">Excel</rs> (<rs type="creator">Microsoft Corporation</rs>, Redmond, WA)
<rs type="software_name">Excel</rs> by <rs type="creator">Microsoft<rs>

Otherwise we code creator in the preceding publisher.

Calculations were made using <rs type="creator">MS</rs> <rs type="software">Excel</rs>.
We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.

@jameshowison
Copy link
Contributor

What do you think @Kermit2?

@kermitt2
Copy link
Member

kermitt2 commented Oct 29, 2019

This is indeed exactly the rules I tried to follow for having some consistency - except for the raised cases like GraphPad Prism and Lotus Notes for which the "publisher" name is so commonly attached to the actual software name that it's only after reviewing many paragraphs that I realized that the rule was not applied.

I think it makes sense however to apply the rules systematically, so having

We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.

and

<rs type="creator">Lotus</rs> <rs type="software">Notes</rs>

@kermitt2
Copy link
Member

I have to confess also a bias :)

I think I kept those few exceptions like Lotus Notes, because I had in mind the problem of disambiguation/matching of the software mention in existing software knowledge bases. I know that after extracting all software name mention, we want to deduplicate them and match them to a software "entity".

If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having Lotus and just Notes make the matching a bit more complicated as we need to combine different extracted fields.

@jameshowison
Copy link
Contributor

jameshowison commented Oct 30, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants