GROBID software-mentions module

Work in progress.

The goal of this GROBID module is to recognize in textual documents and PDF any mentions of softwares.

As the other GROBID models, the module relies only on machine learning and uses linear CRF.

Install, build, run

Building module requires maven and JDK 1.8.

First install and build the latest development version of GROBID as explained by the documentation.

Copy the module software-mentions as sibling sub-project to grobid-core, grobid-trainer, etc.:

cp -r software-mentions grobid/

Copy the provided pre-trained model in the standard grobid-home path:

cd grobid/software-mentions/

mkdir ../grobid-home/models/software

cp resources/model/model.wapiti ../grobid-home/models/software/

Try compiling everything with:

mvn -Dmaven.test.skip=true clean install

Run some test:

mvn compile test

Start the service

mvn -Dmaven.test.skip=true jetty:run-war

Javascript demo/console web app is then accessible at http://localhost:8060. From the console and the RESTfull services tab, you can process chunk of text (select ProcessText) or process a complete PDF document (select Annotate PDF document).

Using curl POST/GET requests with some text:

curl -X POST -d "text=The next step is to install GROBID version 0.5.1." localhost:8060/processSoftwareText

which should return this:

{
    "entities": [{
        "software-name": {
            "rawForm": "GROBID",
            "offsetStart": 28,
            "offsetEnd": 34
        },
        "type": "software",
        "version-number": {
            "rawForm": "version 0.5.1",
            "offsetStart": 35,
            "offsetEnd": 48
        }
    }],
    "runtime": 2
}

curl -GET --data-urlencode "text=The final step is to update GROBID version 0.5.2-SNAPSHOT." localhost:8060/processSoftwareText

Using curl POST/PUT requests with a PDF file:

curl --form input=@./thefile.pdf localhost:8060/annotateSoftwarePDF

Runtimes are expressed in milliseconds.

Training and evaluation

Training only

For training the software model with all the available training data:

> cd PATH-TO-GROBID/grobid/software-mentions/

> mvn generate-resources -Ptrain_software

The training data must be under software-mentions/resources/dataset/software/corpus

Evaluating only

For evaluating under the labeled data under grobid-astro/resources/dataset/software/evaluation, use the command:

>  mvn compile exec:exec -Peval_software

Training and evaluating with automatic corpus split

The following commands will split automatically and randomly the available annotated data (under resources/dataset/software/corpus/) into a training set and an evaluation set, train a model based on the first set and launch an evaluation based on the second set.

>  mvn compile exec:exec -Peval_software_split

In this mode, by default, 90% of the available data is used for training and the remaining for evaluation. This ratio can be changed by editing the corresponding exec profile in the pom.xml file.

Training data import

Importing the softcite dataset

The source of training data is the softcite dataset developed by James Howison Lab at the University of Texas at Austin. The data need to be compiled with actual PDF content prelimiary to training in order to create XML annotated document (MUC conference style). This is done with the following command which takes 3 arguments:

> mvn exec:java -Dexec.mainClass=org.grobid.trainer.AnnotatedCorpusGeneratorCSV -Dexec.args="path/to/the/pdf/repo/ /path/to/csv/softcite-dataset/data/csv_dataset/ resources/dataset/software/corpus/"

The path to the PDF repo is the path where the PDF corresponding to the annotated document will be downloaded (done only the first time). For instance:

> mvn exec:java -Dexec.mainClass=org.grobid.trainer.AnnotatedCorpusGeneratorCSV -Dexec.args="/home/lopez/tools/softcite-dataset/pdf/ /home/lopez/tools/softcite-dataset/data/csv_dataset/ resources/dataset/software/corpus/"

The compiled XML training files will be written in the standard GROBID training path for the softwate recognition model under grobid/software-mentions/resources/dataset/software/corpus/.

Inter-Annotator Agreement measures

The import process includes the computation of standard Inter-Annotator Agreement (IIA) measures for the documents being annotated by at least two annotators. For the moment, the reported IIA is a percentage agreement measure, with standard error and confidence interval.

See this nice tutorial about IIA. We might need more sophisticated IIA measures than just pourcentage agreement for more robustness. We plan, in addition to pourcentage agreement, to also cover various IIA metrics from π, κ, and α families using the dkpro-statistics-agreement library:

Christian M. Meyer, Margot Mieskes, Christian Stab, and Iryna Gurevych: DKPro Agreement: An Open-Source Java Library for Measuring Inter-Rater Agreement, in: Proceedings of the 25th International Conference on Computational Linguistics (COLING), pp. 105–109, August 2014. Dublin, Ireland.

For explanations on these IIA measures, see:

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555-596.

Analysis of training data consistency

A Python 3.* script is available under script/ to analyse XML training data and spot possible unconsistencies to review. To launch the script:

> python3 script/consistency.py _absolute_path_to_training_directory_

For instance:

> python3 script/consistency.py /home/lopez/grobid/software-mentions/resources/dataset/software/corpus/

See the description of the output directly in the header of the script/consistency.py file.

Generation of training data

For generating training data in XML/TEI, based on the current model, from a list of text or PDF files in a input repository, use the following command:

> java -Xmx4G -jar target/software-mentions/-0.5.1-SNAPSHOT.one-jar.jar -gH ../grobid-home -dIn ~/test_software/ -dOut ~/test_software/out/ -exe createTraining

License

GROBID and the grobid software-mentions module are distributed under Apache 2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GROBID software-mentions module

Install, build, run

Start the service

Training and evaluation

Training only

Evaluating only

Training and evaluating with automatic corpus split

Training data import

Importing the softcite dataset

Inter-Annotator Agreement measures

Analysis of training data consistency

Generation of training data

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

GROBID software-mentions module

Install, build, run

Start the service

Training and evaluation

Training only

Evaluating only

Training and evaluating with automatic corpus split

Training data import

Importing the softcite dataset

Inter-Annotator Agreement measures

Analysis of training data consistency

Generation of training data

License