Skip to content

Quickly annotate your existing dataset with linguistic features (POS, NE, DEP) using Stanford CoreNLP

License

Notifications You must be signed in to change notification settings

ChristophAlt/StAn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CircleCI

StAn - Quickly annotate your dataset with Stanford CoreNLP

In natural language processing, algorithms often require additional linguistic features (syntactic and semantic), such as part-of-speech, named entity, and dependency tags; information that is not readily available in most datasets. StAn provides a convenient way to quickly annotate an existing dataset with additional linguistic features computed by Stanford CoreNLP.

Getting Started

Prerequisites

StAn either uses a local CoreNLP installation or an exisiting CoreNLP Server. To use a local installation, download and unpack the latest version from the Stanford CoreNLP website.

Installing

With pip

TBD

From Source

Clone the repository and run:

pip install [--editable] .

Usage

For example, the following command annotates the SemEval 2010 Task 8 relation extraction dataset with POS, NER, and dependency information and saves it in JSONL format.

stan \
    --input-dir $INPUT_PATH/SemEval2010_task8_all_data/ \
    --output-dir $OUTPUT_PATH/ \
    --corenlp $PATH_TO_CORENLP_JAR_OR_SERVER_URL \
    --input-format semeval2010task8 \
    --output-format jsonl \
    --shuffle \
    --validation-size 0.1 \
    --n-jobs 4

Parameters:

  • input-dir: the directory containing the dataset or dataset files. StAn expects a specific structure for common datasets (e.g. SemEval 2010 Task 8). The format of the input is specified by input-format.
  • output-dir: the directory to store the annotated dataset. The format in which to save the dataset is specified by output-format.
  • corenlp: the path to the directory containing the CoreNLP jar file or a url pointing to an exisiting CoreNLP server.
  • input-format: the format of the input dataset, can be one of "semeval2010task8", "json" or "jsonl".
  • output-format: the format of the output dataset, can be one of "tacred", "json", "jsonl".
  • shuffle: whether to shuffle the training dataset before splitting into train and validation (only if validation size > 0).
  • validation-size: if > 0, use a validation-size fraction of the training dataset for validation.
  • n-jobs: the number of threads to use for concurrent requests to CoreNLP.

Running the tests

Explain how to run the automated tests for this system

Unittests

pytest -v tests/

Typechecker and coding style tests

mypy stan --ignore-missing-imports

Built With

Authors

  • Christoph Alt

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

Quickly annotate your existing dataset with linguistic features (POS, NE, DEP) using Stanford CoreNLP

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages