You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We talked about creating a model using naive string matching. Primary use is to identify areas likely to be "mention-rich," given our finding from the manual annotation that mentions tend to cluster together in papers. Expectation is that using "go" lists of known software, adjusted for well-known ambiguous phrases, can find those mention rich chunks for further annotation. Expect decent recall, but very low precision!
To that end commit a02b847 moved the software_lists I was playing around with into data/software_lists/@kermitt2 is going to use those to implement a matching model, resulting in json files with entity_spans with resp="naive_string_match" or something similar.
Might be interesting to compare that effort against our gold standard annotations (after removing the specific strings for software names from that set), and against the trained model.
The text was updated successfully, but these errors were encountered:
Under softcite-dataset/code/corpus/ we have a set scripts to convert TEI files to JSON with annotations corresponding to basic matching using the data/software_lists/. Documentation here.
The sequence would be as follow:
we start from a set of PDF (from the Softcite corpus or any new PDF), they can be converted into TEI XML via Grobid (Grobid must be installed, via its Docker image for example), the simplest begin using the grobid python client
we convert the TEI XML into the JSON format via TEI2LossyJSON.py: we have paragraph-level segments with "ref-spans"
we add software annotations, via corpus2JSON.py if we started with the PDF from the Softcite corpus and want to inject the manual annotations, or via enrichJSON.py to add annotations with the "naive string matching" method or alternatively via the software mention service (it must be installed and running)
When annotations are added, we will have sentence-level segments with "ref-spans" and the added "entity-spans" for the software annotations.
About the corresponding added data introduced in #665:
softcite-dataset/data/tei contains the TEI files for the softcite corpus PDF (obtained with Grobid)
softcite-dataset/data/json contains the JSON files with the softcite corpus manual annotations (obtained with corpus2JSON.py).
softcite-dataset/data/json_with_whitelist contains the JSON files with the softcite corpus manual annotations and the "naive string matching" annotations (identified with"resp": "whitelist") (obtained with enrichJSON.py).
We talked about creating a model using naive string matching. Primary use is to identify areas likely to be "mention-rich," given our finding from the manual annotation that mentions tend to cluster together in papers. Expectation is that using "go" lists of known software, adjusted for well-known ambiguous phrases, can find those mention rich chunks for further annotation. Expect decent recall, but very low precision!
To that end commit a02b847 moved the software_lists I was playing around with into
data/software_lists/
@kermitt2 is going to use those to implement a matching model, resulting in json files with entity_spans with resp="naive_string_match" or something similar.Might be interesting to compare that effort against our gold standard annotations (after removing the specific strings for software names from that set), and against the trained model.
The text was updated successfully, but these errors were encountered: