naive string matching model #666

jameshowison · 2020-08-13T17:38:35Z

We talked about creating a model using naive string matching. Primary use is to identify areas likely to be "mention-rich," given our finding from the manual annotation that mentions tend to cluster together in papers. Expectation is that using "go" lists of known software, adjusted for well-known ambiguous phrases, can find those mention rich chunks for further annotation. Expect decent recall, but very low precision!

To that end commit a02b847 moved the software_lists I was playing around with into data/software_lists/ @kermitt2 is going to use those to implement a matching model, resulting in json files with entity_spans with resp="naive_string_match" or something similar.

Might be interesting to compare that effort against our gold standard annotations (after removing the specific strings for software names from that set), and against the trained model.

The text was updated successfully, but these errors were encountered:

kermitt2 · 2020-08-17T23:03:11Z

Back to the #666

Everything is available now.

Under softcite-dataset/code/corpus/ we have a set scripts to convert TEI files to JSON with annotations corresponding to basic matching using the data/software_lists/. Documentation here.

The sequence would be as follow:

we start from a set of PDF (from the Softcite corpus or any new PDF), they can be converted into TEI XML via Grobid (Grobid must be installed, via its Docker image for example), the simplest begin using the grobid python client
we convert the TEI XML into the JSON format via TEI2LossyJSON.py: we have paragraph-level segments with "ref-spans"
we add software annotations, via corpus2JSON.py if we started with the PDF from the Softcite corpus and want to inject the manual annotations, or via enrichJSON.py to add annotations with the "naive string matching" method or alternatively via the software mention service (it must be installed and running)

When annotations are added, we will have sentence-level segments with "ref-spans" and the added "entity-spans" for the software annotations.

About the corresponding added data introduced in #665:

softcite-dataset/data/tei contains the TEI files for the softcite corpus PDF (obtained with Grobid)
softcite-dataset/data/json contains the JSON files with the softcite corpus manual annotations (obtained with corpus2JSON.py).
softcite-dataset/data/json_with_whitelist contains the JSON files with the softcite corpus manual annotations and the "naive string matching" annotations (identified with"resp": "whitelist") (obtained with enrichJSON.py).

kermitt2 mentioned this issue Aug 17, 2020

Data addition #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

naive string matching model #666

naive string matching model #666

jameshowison commented Aug 13, 2020 •

edited by caifand

Loading

kermitt2 commented Aug 17, 2020

naive string matching model #666

naive string matching model #666

Comments

jameshowison commented Aug 13, 2020 • edited by caifand Loading

kermitt2 commented Aug 17, 2020

jameshowison commented Aug 13, 2020 •

edited by caifand

Loading