Skip to content

Latest commit

 

History

History
50 lines (40 loc) · 3.09 KB

README.md

File metadata and controls

50 lines (40 loc) · 3.09 KB

CoreNLPRusModels

Stanford Tagger and NN Dependency Parser Models for Russian Language

  1. Parser models
  2. Tagger models and lemmatization resources

Getting Started with Pipeline for Russian language

  1. Clone CoreNLP from the project repository.

  2. Download resources for lemmatization 'dict.tsv', tagger and parser models using links in section 'CoreNLPRusModels' above.

  3. Build the project and run the Launcher (edu.stanford.nlp.international.russian.process.Launcher).
    Obligatory Launcher parameters are the following:

  • -tagger - filepath to POS-tagging model russian-ud-pos.tagger;
  • -taggerMF - filepath to POS-tagging model russian-ud-mf.tagger, which outputs POS-tags with inflectional morphological features (according to UD v.2), and these morpho features are reused by the parsing model;
  • -mf - if this flag is True, inflectional morphology is written to the FEATS field of the CoNLL annotations;
  • -parser - dependency parser model, inventory of syntactic relations meets UD v.2, better start with the model nndep.rus.modelMFWiki100HS400_80.txt.gz, which uses embeddings, trained on Wikipedia dump;
  • -pLemmaDict - filepath to dict.tsv, preferrably to put it to /CoreNLP/src/edu/stanford/nlp/international/russian/process directory;
  • -pText - filepath to input file, encoding = UTF-8; /home/filepath/input_file.txt
  • -pResults - filepath to output file '.conll', format = CoNLL-U.
  1. Running from console example:
java -Xmx8g edu.stanford.nlp.international.russian.process.Launcher -tagger russian-ud-pos.tagger -taggerMF russian-ud-mf.tagger -pLemmaDict src/edu/stanford/nlp/international/russian/process/dict.tsv -parser nndep.rus.modelMFWiki100HS400_80.txt.gz -pText input.txt -pResults output.conll -mf 

Other Requirements

  • Java 1.8
  • allocate at less 5 Gb for JVM: -Xmx5g
  • input file encoding: UTF-8

If you find the pipeline useful in your research, please consider citing our paper:

@inproceedings{DBLP:conf/kesw/KovriguinaSSP17,
  author    = {Liubov Kovriguina and
               Ivan Shilin and
               Alexander Shipilo and
               Alina Putintseva},
  title     = {Russian Tagging and Dependency Parsing Models for Stanford CoreNLP
               Natural Language Toolkit},
  booktitle = {Knowledge Engineering and Semantic Web - 8th International Conference,
               {KESW} 2017, Szczecin, Poland, November 8-10, 2017, Proceedings},
  pages     = {101--111},
  year      = {2017},
  doi       = {10.1007/978-3-319-69548-8\_8}
}