Skip to content

A part-of-speech tagger with support for domain adaptation and external resources.

License

Notifications You must be signed in to change notification settings

tsproisl/SoMeWeTa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoMeWeTa

PyPI

Introduction

SoMeWeTa (short for Social Media and Web Tagger) is a part-of-speech tagger that supports domain adaptation and that can incorporate external sources of information such as Brown clusters and lexica. It is based on the averaged structured perceptron and uses beam search and an early update strategy. It is possible to train and evaluate the tagger on partially annotated data.

SoMeWeTa achieves state-of-the-art results on the German web and social media texts from the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. Therefore, SoMeWeTa is particularly well-suited to tag all kinds of written German discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

The system is described in greater detail in Proisl (2018).

For tokenization and sentence splitting on these kinds of text, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts:

somajo-tokenizer --split_sentences <file> | somewe-tagger --tag <model> -

In addition to the German web and social media model, we also provide models trained on German, English and French newspaper texts, as well as models for Bhojpuri and spoken Italian. For all languages, SoMeWeTa achieves highly competitive results close to the current state of the art.

Installation

SoMeWeTa can be easily installed using pip:

pip3 install SoMeWeTa

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/SoMeWeTa.git

In the new directory, run the following command:

python3 setup.py install

Optional dependency

If your Python version has insertion ordered dictionaries (for CPython this means version 3.6 and later, for any other Python implementation this means 3.7 and later), you can drastically reduce the amount of memory needed for loading a tagger model by installing the ijson library:

pip3 install ijson

Usage

You can use the tagger as a standalone program from the command line. General usage information is available via the -h option:

somewe-tagger -h

Tagging a text

SoMeWeTa requires that the input texts are tokenized and split into sentences. Tokenization and sentence splitting have to be consistent with the corpora the tagger model has been trained on. For German texts, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts. The expected input format is one token per line with an empty line after each sentence.

To tag a file, run the following command:

somewe-tagger --tag <model> <file>

If your machine has multiple cores, you can use the --parallel option to speed up tagging. To tag a file using four cores, use this command:

somewe-tagger --parallel 4 --tag <model> <file>

Using the option -x or --xml, it is possible to tag an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --tag <model> <file>

When called with the --progress option, SoMeWeTa displays tagging progress, average and current tagging speed and remaining time.

Training the tagger

The expected input format for training the tagger is one token-pos pair per line, where token and pos are seperated by a tab character, and an empty line after each sentence. To train a model, run the following command:

somewe-tagger --train <model> <file>

SoMeWeTa supports domain adaptation. First train a model on the background corpus, then use this model as prior when training the in-domain model:

somewe-tagger --train <model> --prior <background_model> <file>

SoMeWeTa can make use of additional sources of information. You can use the --brown option to provide a file with Brown clusters (the paths file produced by wcluster) and the --lexicon option to provide a lexicon with additional token-level information. The lexicon should consist of lines with tab-separated token-value pairs, e.g.:

welcome	ADJ
welcome	INTJ
welcome	NOUN
welcome	VERB
work	NOUN
work	VERB

It is also possible to train the tagger on partially annotated data. To do this, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --train <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to train the tagger on an XML file. It is assumed that each XML tag is on a separate line:

somewe-tagger --xml --train <model> <file>

Evaluating a model

To evaluate a model, you need an annotated input file in the same format as for training. Then you can run the following command:

somewe-tagger --evaluate <model> <file>

You can also evaluate a model on partially annotated data. Simply assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --evaluate <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to evaluate a model on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --evaluate <model> <file>

Performing cross-validation

You can also perform a 10-fold cross-validation on a training corpus:

somewe-tagger --crossvalidate <file>

To perform a cross-validation on partially annotated data, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --crossvalidate --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to perform a cross-validation on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --crossvalidate <file>

Using the module

To incorporate the tagger into your own Python project, you have to import someweta.ASPTagger, create an ASPTagger object, load a pretrained model and call the tag_sentence method:

from someweta import ASPTagger

model = "german_web_social_media_2018-12-21.model"
sentences = [["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."],
             ["Zeitfliegen", "mögen", "einen", "Pfeil", "."]]

asptagger = ASPTagger()
asptagger.load(model)

for sentence in sentences:
    tagged_sentence = asptagger.tag_sentence(sentence)
    print("\n".join(["\t".join(t) for t in tagged_sentence]), "\n", sep="")

Here is an example for using SoMaJo and SoMeWeTa in combination, performing tokenization, sentence splitting and part-of-speech tagging:

import somajo
import someweta

filename = "test.txt"
model = "german_web_social_media_2018-12-21.model"

asptagger = someweta.ASPTagger()
asptagger.load(model)

# See https://github.com/tsproisl/SoMaJo#using-the-module
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=False)
sentences = tokenizer.tokenize_text_file(filename, paragraph_separator="empty_lines")
for sentence in sentences:
    tokens = [token.text for token in sentence]
    tagged_sentence = asptagger.tag_sentence(tokens)
    print("\n".join("\t".join(t) for t in tagged_sentence), "\n", sep="")

Model files

Model tagset est. accuracy
German newspaper STTS (TIGER) 98.02%
German web and social media STTS_IBK 92.18%
English newspaper Penn 97.25%
French newspaper FTB-29 97.71%
Spoken Italian UD (KIPoS) 91.79%
Bhojpuri BIS-33 92.58%

German newspaper texts

This model has been trained on the entire TIGER corpus and uses Brown clusters (extracted from DECOW16AX, GeRedE and a collection of German tweets) and coarse wordclasses extracted from Morphy as additional information.

To estimate the accuracy of this model, we performed a 10-fold cross-validation on the TIGER corpus with the same settings, resulting in a 95% confidence interval of 98.02% ±0.12.

Download model (111 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

German web and social media texts

This model uses a variant of the above model as prior and is trained on the entire EmpiriST 2.0 corpus, i.e. both the training and the test data, as well as a little bit of additional training data (cf. the data directory of this repository). It uses the same additional sources of information as the prior model.

A variant of this model that only uses the training part of the EmpiriST corpus achieves a mean accuracy of 92.18% on the two test sets:

Corpus all words known words unknown words
CMC 90.39 ±0.30 92.42 ±0.29 77.57 ±1.40
Web 93.96 ±0.16 95.56 ±0.17 83.40 ±0.69

Download model (112 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

English newspaper texts

This model has been trained on all sections of the Wall Street Journal part of the Penn Treebank and uses Brown clusters extracted from ENCOW14 and part-of-speech data extracted from the English DELA dictionary as additional information.

A variant of this model that was trained only on sections 0–18 of the Wall Street Journal achieves the following results on the usual development and test sets:

Data set all words known words unknown words
dev (19–21) 97.15 ±0.02 97.41 ±0.03 89.59 ±0.28
test (22–24) 97.25 ±0.02 97.42 ±0.03 91.05 ±0.29

Download model (38 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

French newspaper texts

This model has been trained on the French Treebank and uses Brown clusters extracted from FRCOW16 and part-of-speech data extracted from the French DELA dictionary as additional information.

The French Treebank is annotated with two different tagsets: A coarse-grained tagset consisting of 15 tags and a more fine-grained tagset consisting of 29 tags. The model has been trained on the more fine-grained tagset. However, we provide a mapping to the smaller tagset (data/mapping_french_29_to_15.json) that can be used to annotate a text with both tagsets:

somewe-tagger --tag <model> --mapping <mapping> <file>

To estimate the accuracy of the model, we performed a 10-fold cross-validation on the French Treebank using the same settings:

tagset accuracy
29 tags 97.71 ±0.13
15 tags (mapped) 98.22 ±0.11

Download model (28 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

Spoken Italian

This model has been pretrained on the union of all Italian corpora in the Universal Dependencies project and then been adapted to spoken Italian using annotated data from the KIParla corpus. The model uses coarse-grained wordclass information from Morph-it! and Brown clusters extracted from a collection of Italian corpora (OpenSubtitles, Reddit posts, PAISÀ, Wikimedia dumps, OSCAR). The input text must be tokenized according to the UD tokenization guidelines. In particular, the model expects that contracted forms like parlarmi (parlar + mi) or della (di + la) are split into their constituents. A detailed description and analysis of the model is available in Proisl and Lapesa (2020).

A variant of this model that only uses the training part of the KIParla corpus achieves a mean accuracy of 91.79% on the two test sets:

Corpus all words known words unknown words
formal 92.67 93.39 67.92
informal 90.90 91.41 75.00

Download model (43 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

Bhojpuri

This model has been trained on ca. 105,000 tokens of annotated Bhojpuri text provided by the organizers of the NSURL shared task for Bhojpuri. Additionally, the model uses Brown clusters extracted from text collections of related languages (Hindi and Bihari Wikimedia dumps and a Magahi corpus). The model uses a fine-grained variant of the Bureau of Indian Standards (BIS) annotation scheme with 33 tags. A more detailed description of the model can be found in Proisl et al. (2019).

A variant of this model that only uses the training part of the dataset achieves an accuracy of 92.58% on the test set:

All words known words unknown words
92.58 94.57 75.09

Download model (3,7 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

References

  • If you use SoMeWeTa for academic research, please consider citing the following paper:

    Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–670. Miyazaki: European Language Resources Association (ELRA). PDF.

    @InProceedings{Proisl_LREC:2018,
      author    = {Proisl, Thomas},
      title     = {{SoMeWeTa}: {A} Part-of-Speech Tagger for {G}erman Social Media and Web Texts},
      booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)},
      year      = {2018},
      address   = {Miyazaki},
      publisher = {European Language Resources Association {ELRA}},
      pages     = {665--670},
      url       = {http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf},
    }
  • If you use the model for spoken Italian, please consider citing also the following paper:

    Proisl, Thomas, and Gabriella Lapesa. 2020. “KLUMSy@KIPoS: Experiments on Part-of-Speech Tagging of Spoken Italian.” In Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). CEUR-WS.org. PDF.

    @InProceedings{Proisl_Lapesa_EVALITA:2020,
      author    = {Proisl, Thomas and Lapesa, Gabriella},
      title     = {{KLUMSy@KIPoS}: Experiments on Part-of-Speech Tagging of Spoken {I}talian},
      booktitle = {Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for {I}talian ({EVALITA} 2020)},
      year      = {2020},
      editor    = {Basile, Valerio and Croce, Danilo and Di Maro, Maria and Passaro, Lucia C.},
      address   = {Online},
      publisher = {CEUR-WS.org},
      url       = {http://ceur-ws.org/Vol-2765/paper140.pdf}
    }
  • If you use the Bhojpuri model, please consider citing also the following paper:

    Proisl, Thomas, Peter Uhrig, Philipp Heinrich, Andreas Blombach, Sefora Mammarella, Natalie Dykes, and Besim Kabashi. 2019. “The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri Without Even Knowing the Alphabet.” In Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019), 73–79. Trento: Association for Computational Linguistics. PDF.

    @InProceedings{Proisl_et_al_NSURL:2019,
      author    = {Proisl, Thomas and Uhrig, Peter and Heinrich, Philipp and Blombach, Andreas and Mammarella, Sefora and Dykes, Natalie and Kabashi, Besim},
      title     = {{T}he\_{I}lliterati: Part-of-Speech Tagging for {M}agahi and {B}hojpuri without Even Knowing the Alphabet},
      booktitle = {Proceedings of the First International Workshop on {NLP} Solutions for Under Resourced Languages ({NSURL} 2019)},
      year      = {2019},
      pages     = {73--79},
      address   = {Trento},
      publisher = {Association for Computational Linguistics},
      url       = {https://www.aclweb.org/anthology/2019.nsurl-1.11}
    }