Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining issues/lack of annotation consistency #638

Open
kermitt2 opened this issue Aug 30, 2019 · 4 comments
Open

Remaining issues/lack of annotation consistency #638

kermitt2 opened this issue Aug 30, 2019 · 4 comments

Comments

@kermitt2
Copy link
Member

kermitt2 commented Aug 30, 2019

Here are some remaining issues we observed in the current annotation scheme:

  • we kept so far the type name creator, but actually we almost always have here the "publisher" of the software, having a person name is exceptional. We could thus use rather publisher as name for this annotation, given than annotated entities have not always created the software but simply commercialize it (for instance IBM is not the creator of SPSS, it has acquired SPSS Inc.).

  • Sometimes, the name of the software publisher are used to refer to a software (PMC3534176):

calculates the area on the binary images that can be produced by <rs type="software">MathWorks</rs>.

MathWorks is the company developing MATLAB (the correct name was introduced at the beginning of the paper, but a strange shift in referring expressions happened in the middle of the paper!). It is hard to decide how to annotate this case.

  • It is a bit difficult to annotate the full software name (with acronym) when it is not continuous, for example:
signals detected using <rs type="software" xml:id="PMC2649809-software-35">Affymetrix microarray suite</rs> version <rs corresp="#PMC2649809-software-35" type="version">5</rs> software (MAS5) for each probe were averaged over 21 caudate nucleus.

We leave "MAS5" (the acronym of "Affymetrix microarray suite") unannotated while it could be valuable for disambiguation. Currently software name are always considered as a continuous chunk.

As an improvement, we could use non-continuous software name annotation like this:

signals detected using <rs type="software" xml:id="PMC2649809-software-35">Affymetrix microarray suite</rs> version <rs corresp="#PMC2649809-software-35" type="version">5</rs> software (<rs corresp="#PMC2649809-software-35" type="software">MAS5</rs>) for each probe were averaged over 21 caudate nucleus.
  • The "command" or "command line" part of some framework is sometimes directly mentioned but not uniformly annotated e.g.
We used the <rs type="software">MATLAB</rs> command <rs type="software">fmin- search</rs> with multiple starting points to compute the maximum likelihood estimate for this value.

Thus, linear regression with robust standard errors using the <rs type="software">STATA</rs> command "cluster (cluster variable)"was used-which relaxes the independence assumption and requires only that the observations should be independent across the clusters (STATA 2013).

We observed this case as encoded as another software entity (as above first example), sometimes both together in one, sometimes only the framework is annotated (as above second example). This case is not frequent and we have not fixed an annotation rule for this yet.

  • We normally distinguish software publisher and software name when used in combination. For instance:
<rs corresp="#PMC0000000-software-1" type="creator">Microsoft</rs> <rs type="software" xml:id="PMC0000000-software-1">Excel</rs>

However we have not considered for the moment the "GraphPad Prism" case, where the name of the software is actually Prism and its editor is GraphPad, so it should normally be annotated like the "Microsoft Excel" case.

<rs id="software-1" type="software">GraphPad Prism</rs> <rs corresp="#software-1" type="version">5</rs> software (<rs corresp="#software-1" type="creator">GraphPad Software, Inc</rs>., La Jolla, CA, USA).

Similarly "Lotus Notes" is always identified as such, and not as "notes" from Lotus Inc. (although it is now called IBM Notes, but it's another story). So here unconsistencies remain for the moment.

@caifand
Copy link
Contributor

caifand commented Sep 8, 2019

Several responses here:

  • But we do have people's name as creator. And that's the part we definitely want to keep, for giving credit to the actual software creators. Is there any automatic way to distinguish publisher from creator?
    (I can think of matching organization entity names via arcGIS API but that would cost some time.)
  • I did notice some papers in which software is not named accurately by authors. But I think such cases are rare and if they really deteriorate the dataset quality how about just dropping them?
  • The way to annotate non-continuous software names is very cool. Is it something can be done automatically or it will need more input from the training set?
  • I still think subroutines should be separately coded given that there are usually separate creators associated with them. This including both packages in R or command in MATLAB. i.e., in either case, there should be two mentions being coded, in my opinion. (in response to one of the points in Additional corrections and rules for consistent annotations #637 as well)

@kermitt2
Copy link
Member Author

Hi @caifand

About the first point, I've went through the creators and out of 1120 creator annotations, there are only 15 "person" creators (1,3%). I marked them with an attribute @subtype="person" in the "packaged" format (https://github.com/Impactstory/software-mentions/blob/master/resources/dataset/software/corpus/all.clean.tei.xml).
I could use entity-fishing on the "software publishers" and try to link them with Wikidata (it might be richer than arcGIS and will takes just a few seconds).

@caifand
Copy link
Contributor

caifand commented Sep 17, 2019

Cool, thanks! By the way, how do you work with tei xml? In python?

@kermitt2
Copy link
Member Author

Yes python has nice library for reading and manipulating XML (much easier to use than the Java ones I think), for instance ElementTree is a standard Python library or lxml which requires a dependency but is more complete.

Then I have to say working in general with XML remains painful by design ;)
But when it comes to representing a complete structured document, XML can't really be avoided imho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants