Skip to content

Pre-trained models and language resources for Natural Language Processing in Polish

License

Notifications You must be signed in to change notification settings

sdadas/polish-nlp-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Polish NLP resources

This repository contains pre-trained models and language resources for Natural Language Processing in Polish created during my research. Some of the models are also available on Huggingface Hub.

If you'd like to use any of those resources in your research please cite:

@Misc{polish-nlp-resources,
  author =       {S{\l}awomir Dadas},
  title =        {A repository of Polish {NLP} resources},
  howpublished = {Github},
  year =         {2019},
  url =          {https://github.com/sdadas/polish-nlp-resources/}
}

Contents

Word embeddings

The following section includes pre-trained word embeddings for Polish. Each model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, 1.5 billion tokens at total.

Word2Vec

Word2Vec trained with Gensim. 100 dimensions, negative sampling, contains lemmatized words with 3 or more ocurrences in the corpus and additionally a set of pre-defined punctuation symbols, all numbers from 0 to 10'000, Polish forenames and lastnames. The archive contains embedding in gensim binary format. Example of usage:

from gensim.models import KeyedVectors

if __name__ == '__main__':
    word2vec = KeyedVectors.load("word2vec_100_3_polish.bin")
    print(word2vec.similar_by_word("bierut"))
    
# [('cyrankiewicz', 0.818274736404419), ('gomułka', 0.7967918515205383), ('raczkiewicz', 0.7757788896560669), ('jaruzelski', 0.7737460732460022), ('pużak', 0.7667238712310791)]

Download (GitHub)

FastText

FastText trained with Gensim. Vocabulary and dimensionality is identical to Word2Vec model. The archive contains embedding in gensim binary format. Example of usage:

from gensim.models import KeyedVectors

if __name__ == '__main__':
    word2vec = KeyedVectors.load("fasttext_100_3_polish.bin")
    print(word2vec.similar_by_word("bierut"))
    
# [('bieruty', 0.9290274381637573), ('gierut', 0.8921363353729248), ('bieruta', 0.8906412124633789), ('bierutow', 0.8795544505119324), ('bierutowsko', 0.839280366897583)]

Download (OneDrive)

GloVe

Global Vectors for Word Representation (GloVe) trained using the reference implementation from Stanford NLP. 100 dimensions, contains lemmatized words with 3 or more ocurrences in the corpus. Example of usage:

from gensim.models import KeyedVectors

if __name__ == '__main__':
    word2vec = KeyedVectors.load_word2vec_format("glove_100_3_polish.txt")
    print(word2vec.similar_by_word("bierut"))
    
# [('cyrankiewicz', 0.8335597515106201), ('gomułka', 0.7793121337890625), ('bieruta', 0.7118682861328125), ('jaruzelski', 0.6743760108947754), ('minc', 0.6692837476730347)]

Download (GitHub)

High dimensional word vectors

Pre-trained vectors using the same vocabulary as above but with higher dimensionality. These vectors are more suitable for representing larger chunks of text such as sentences or documents using simple word aggregation methods (averaging, max pooling etc.) as more semantic information is preserved this way.

GloVe - 300d: Part 1 (GitHub), 500d: Part 1 (GitHub) Part 2 (GitHub), 800d: Part 1 (GitHub) Part 2 (GitHub) Part 3 (GitHub)

Word2Vec - 300d (OneDrive), 500d (OneDrive), 800d (OneDrive)

FastText - 300d (OneDrive), 500d (OneDrive), 800d (OneDrive)

Compressed Word2Vec

This is a compressed version of the Word2Vec embedding model described above. For compression, we used the method described in Compressing Word Embeddings via Deep Compositional Code Learning by Shu and Nakayama. Compressed embeddings are suited for deployment on storage-poor devices such as mobile phones. The model weights 38MB, only 4.4% size of the original Word2Vec embeddings. Although the authors of the article claimed that compressing with their method doesn't hurt model performance, we noticed a slight but acceptable drop of accuracy when using compressed version of embeddings. Sample decoder class with usage:

import gzip
from typing import Dict, Callable
import numpy as np

class CompressedEmbedding(object):

    def __init__(self, vocab_path: str, embedding_path: str, to_lowercase: bool=True):
        self.vocab_path: str = vocab_path
        self.embedding_path: str = embedding_path
        self.to_lower: bool = to_lowercase
        self.vocab: Dict[str, int] = self.__load_vocab(vocab_path)
        embedding = np.load(embedding_path)
        self.codes: np.ndarray = embedding[embedding.files[0]]
        self.codebook: np.ndarray = embedding[embedding.files[1]]
        self.m = self.codes.shape[1]
        self.k = int(self.codebook.shape[0] / self.m)
        self.dim: int = self.codebook.shape[1]

    def __load_vocab(self, vocab_path: str) -> Dict[str, int]:
        open_func: Callable = gzip.open if vocab_path.endswith(".gz") else open
        with open_func(vocab_path, "rt", encoding="utf-8") as input_file:
            return {line.strip():idx for idx, line in enumerate(input_file)}

    def vocab_vector(self, word: str):
        if word == "<pad>": return np.zeros(self.dim)
        val: str = word.lower() if self.to_lower else word
        index: int = self.vocab.get(val, self.vocab["<unk>"])
        codes = self.codes[index]
        code_indices = np.array([idx * self.k + offset for idx, offset in enumerate(np.nditer(codes))])
        return np.sum(self.codebook[code_indices], axis=0)

if __name__ == '__main__':
    word2vec = CompressedEmbedding("word2vec_100_3.vocab.gz", "word2vec_100_3.compressed.npz")
    print(word2vec.vocab_vector("bierut"))

Download (GitHub)

Wikipedia2Vec

Wikipedia2Vec is a toolkit for learning joint representations of words and Wikipedia entities. We share Polish embeddings learned using a modified version of the library in which we added lemmatization and fixed some issues regarding parsing wiki dumps for languages other than English. Embedding models are available in sizes from 100 to 800 dimensions. A simple example:

from wikipedia2vec import Wikipedia2Vec

wiki2vec = Wikipedia2Vec.load("wiki2vec-plwiki-100.bin")
print(wiki2vec.most_similar(wiki2vec.get_entity("Bolesław Bierut")))
# (<Entity Bolesław Bierut>, 1.0), (<Word bierut>, 0.75790733), (<Word gomułka>, 0.7276504),
# (<Entity Krajowa Rada Narodowa>, 0.7081445), (<Entity Władysław Gomułka>, 0.7043667) [...]

Download embeddings: 100d, 300d, 500d, 800d.

Language models

ELMo

Embeddings from Language Models (ELMo) is a contextual embedding presented in Deep contextualized word representations by Peters et al. Sample usage with PyTorch below, for a more detailed instructions for integrating ELMo with your model please refer to the official repositories github.com/allenai/bilm-tf (Tensorflow) and github.com/allenai/allennlp (PyTorch).

from allennlp.commands.elmo import ElmoEmbedder

elmo = ElmoEmbedder("options.json", "weights.hdf5")
print(elmo.embed_sentence(["Zażółcić", "gęślą", "jaźń"]))

Download (GitHub)

RoBERTa

Language model for Polish based on popular transformer architecture. We provide weights for improved BERT language model introduced in RoBERTa: A Robustly Optimized BERT Pretraining Approach. We provide two RoBERTa models for Polish - base and large model. A summary of pre-training parameters for each model is shown in the table below. We release two version of the each model: one in the Fairseq format and the other in the HuggingFace Transformers format. More information about the models can be found in a separate repository.

Model L / H / A* Batch size Update steps Corpus size Fairseq Transformers
RoBERTa (base) 12 / 768 / 12 8k 125k ~20GB v0.9.0 v3.4
RoBERTa‑v2 (base) 12 / 768 / 12 8k 400k ~20GB v0.10.1 v4.4
RoBERTa (large) 24 / 1024 / 16 30k 50k ~135GB v0.9.0 v3.4
RoBERTa‑v2 (large) 24 / 1024 / 16 2k 400k ~200GB v0.10.2 v4.14
DistilRoBERTa 6 / 768 / 12 1k 10ep. ~20GB n/a v4.13

* L - the number of encoder blocks, H - hidden size, A - the number of attention heads

Example in Fairseq:

import os
from fairseq.models.roberta import RobertaModel, RobertaHubInterface
from fairseq import hub_utils

model_path = "roberta_large_fairseq"
loaded = hub_utils.from_pretrained(
    model_name_or_path=model_path,
    data_name_or_path=model_path,
    bpe="sentencepiece",
    sentencepiece_vocab=os.path.join(model_path, "sentencepiece.bpe.model"),
    load_checkpoint_heads=True,
    archive_map=RobertaModel.hub_models(),
    cpu=True
)
roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])
roberta.eval()
roberta.fill_mask('Druga wojna światowa zakończyła się w <mask> roku.', topk=1)
roberta.fill_mask('Ludzie najbardziej boją się <mask>.', topk=1)
#[('Druga wojna światowa zakończyła się w 1945 roku.', 0.9345270991325378, ' 1945')]
#[('Ludzie najbardziej boją się śmierci.', 0.14140743017196655, ' śmierci')]

It is recommended to use the above models, but it is still possible to download our old model, trained on smaller batch size (2K) and smaller corpus (15GB).

BART

BART is a transformer-based sequence to sequence model trained with a denoising objective. Can be used for fine-tuning on prediction tasks, just like regular BERT, as well as various text generation tasks such as machine translation, summarization, paraphrasing etc. We provide a Polish version of BART base model, trained on a large corpus of texts extracted from Common Crawl (200+ GB). More information on the BART architecture can be found in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Example in HugginFace Transformers:

import os
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast

model_dir = "bart_base_transformers"
tok = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BartForConditionalGeneration.from_pretrained(model_dir)
sent = "Druga<mask>światowa zakończyła się w<mask>roku kapitulacją hitlerowskich<mask>"
batch = tok(sent, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
print(tok.batch_decode(generated_ids, skip_special_tokens=True))
# ['Druga wojna światowa zakończyła się w 1945 roku kapitulacją hitlerowskich Niemiec.']

Download for Fairseq v0.10 or HuggingFace Transformers v4.0.

GPT-2

GPT-2 is a unidirectional transformer-based language model trained with an auto-regressive objective, originally introduced in the Language Models are Unsupervised Multitask Learners paper. The original English GPT-2 was released in four sizes differing by the number of parameters: Small (112M), Medium (345M), Large (774M), XL (1.5B).

Models for Huggingface Transformers

We provide Polish GPT-2 models for Huggingface Transformers. The models have been trained using Megatron-LM library and then converted to the Huggingface format. The released checkpoints support longer contexts than the original GPT-2 by OpenAI. Small and medium models support up to 2048 tokens, twice as many as GPT-2 models and the same as GPT-3. Large and XL models support up to 1536 tokens. Example in Transformers:

from transformers import pipeline

generator = pipeline("text-generation",  model="sdadas/polish-gpt2-medium")
results = generator("Policja skontrolowała trzeźwość kierowców",
  max_new_tokens=1024,  do_sample=True, repetition_penalty = 1.2,
  num_return_sequences=1, num_beams=1,  temperature=0.95,top_k=50, top_p=0.95
)
print(results[0].get("generated_text"))
# Policja skontrolowała trzeźwość kierowców. Teraz policjanci przypominają kierowcom o zachowaniu 
# bezpiecznej odległości i środkach ostrożności związanych z pandemią. - Kierujący po spożyciu 
# alkoholu są bardziej wyczuleni na innych uczestników ruchu drogowego oraz mają większą skłonność 
# do brawury i ryzykownego zachowania zwłaszcza wobec pieszych. Dodatkowo nie zawsze pamiętają oni 
# zasady obowiązujących u nas przepisów prawa regulujących kwestie dotyczące odpowiedzialności [...]

Small, Medium, Large, and XL models are available on the Huggingface Hub

Models for Fairseq

We provide Polish versions of the medium and large GPT-2 models trained using Fairseq library. Example in Fairseq:

import os
from fairseq import hub_utils
from fairseq.models.transformer_lm import TransformerLanguageModel

model_dir = "gpt2_medium_fairseq"
loaded = hub_utils.from_pretrained(
    model_name_or_path=model_dir,
    checkpoint_file="model.pt",
    data_name_or_path=model_dir,
    bpe="hf_byte_bpe",
    bpe_merges=os.path.join(model_dir, "merges.txt"),
    bpe_vocab=os.path.join(model_dir, "vocab.json"),
    load_checkpoint_heads=True,
    archive_map=TransformerLanguageModel.hub_models()
)
model = hub_utils.GeneratorHubInterface(loaded["args"], loaded["task"], loaded["models"])
model.eval()
result = model.sample(
    ["Policja skontrolowała trzeźwość kierowców"],
    beam=5, sampling=True, sampling_topk=50, sampling_topp=0.95,
    temperature=0.95, max_len_a=1, max_len_b=100, no_repeat_ngram_size=3
)
print(result[0])
# Policja skontrolowała trzeźwość kierowców pojazdów. Wszystko działo się na drodze gminnej, między Radwanowem 
# a Boguchowem. - Około godziny 12.30 do naszego komisariatu zgłosił się kierowca, którego zaniepokoiło 
# zachowanie kierującego w chwili wjazdu na tą drogę. Prawdopodobnie nie miał zapiętych pasów - informuje st. asp. 
# Anna Węgrzyniak z policji w Brzezinach. Okazało się, że kierujący był pod wpływem alkoholu. [...]

Download medium or large model for Fairseq v0.10.

Longformer

One of the main constraints of standard Transformer architectures is the limitation on the number of input tokens. There are several known models that allow processing of long documents, one of the popular ones being Longformer, introduced in the paper Longformer: The Long-Document Transformer. We provide base and large versions of Polish Longformer model. The models were initialized with Polish RoBERTa (v2) weights and then fine-tuned on a corpus of long documents, ranging from 1024 to 4096 tokens. Example in Huggingface Transformers:

from transformers import pipeline
fill_mask = pipeline('fill-mask', model='sdadas/polish-longformer-base-4096')
fill_mask('Stolica oraz największe miasto Francji to <mask>.')

Base and large models are available on the Huggingface Hub

Text encoders

The purpose of text encoders is to produce a fixed-length vector representation for chunks of text, such as sentences or paragraphs. These models are used in semantic search, question answering, document clustering, dataset augmentation, plagiarism detection, and other tasks which involve measuring semantic similarity or relatedness between text passages.

Paraphrase mining and semantic textual similarity

We share two models based on the Sentence-Transformers library, trained using distillation method described in the paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. A corpus of 100 million parallel Polish-English sentence pairs from the OPUS project was used to train the models. You can download them from the Hugginface Hub using the links below.

Student model Teacher model Download
polish-roberta-base-v2 paraphrase-distilroberta-base-v2 st-polish-paraphrase-from-distilroberta
polish-roberta-base-v2 paraphrase-mpnet-base-v2 st-polish-paraphrase-from-mpnet

A simple example in Sentence-Transformers library:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ["Bardzo lubię jeść słodycze.", "Uwielbiam zajadać się słodkościami."]
model = SentenceTransformer("sdadas/st-polish-paraphrase-from-mpnet")
results = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(results[0], results[1]))
# tensor([[0.9794]], device='cuda:0')

MMLW

MMLW (muszę mieć lepszą wiadomość) is a set of text encoders trained using multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs, which included both sentence and paragraph aligned translations. The encoders are available in Sentence-Transformers format. We used a two-step process to train the models. In the first step, the encoders were initialized with Polish RoBERTa and multilingual E5 checkpoints, and then distilled utilising English BGE as a teacher model. The resulting models from the distillation step can be used as general-purpose embeddings with applications in various tasks such as text similarity, document clustering, or fuzzy deduplication. The second step involved fine-tuning the obtained models on Polish MS MARCO dataset with contrastrive loss. The second stage models are adapted specifically for information retrieval tasks.

We provide a total of ten text encoders, five distilled and five fine-tuned for information retrieval. In the table below, we present the details of the released models.

Base models Stage 1: Distilled models Stage 2: Retrieval models
Student model Teacher model PL-MTEB
Score
Download PIRB
NDCG@10
Download
Encoders based on Polish RoBERTa
polish-roberta-base-v2 bge-base-en 61.05 mmlw-roberta-base 56.38 mmlw-retrieval-roberta-base
polish-roberta-large-v2 bge-large-en 63.23 mmlw-roberta-large 58.46 mmlw-retrieval-roberta-large
Encoders based on Multilingual E5
multilingual-e5-small bge-small-en 55.84 mmlw-e5-small 52.34 mmlw-retrieval-e5-small
multilingual-e5-base bge-base-en 59.71 mmlw-e5-base 56.09 mmlw-retrieval-e5-base
multilingual-e5-large bge-large-en 61.17 mmlw-e5-large 58.30 mmlw-retrieval-e5-large

Please note that the developed models require the use of specific prefixes and suffixes when encoding texts. For RoBERTa-based encoders, each query should be preceded by the prefix "zapytanie: ", and no prefix is needed for passages. For E5-based models, queries should be prefixed with "query: " and passages with "passage: ". An example of how to use the models:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "zapytanie: "      # "zapytanie: " for roberta, "query: " for e5
answer_prefix = ""                # empty for roberta, "passage: " for e5
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-roberta-base")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

Machine translation models

This section includes pre-trained machine translation models.

Convolutional models for Fairseq

We provide Polish-English and English-Polish convolutional neural machine translation models trained using Fairseq sequence modeling toolkit. Both models were trained on a parallel corpus of more than 40 million sentence pairs taken from Opus collection. Example of usage (fairseq, sacremoses and subword-nmt python packages are required to run this example):

from fairseq.models import BaseFairseqModel

model_path = "/polish-english/"
model = BaseFairseqModel.from_pretrained(
    model_name_or_path=model_path,
    checkpoint_file="checkpoint_best.pt",
    data_name_or_path=model_path,
    tokenizer="moses",
    bpe="subword_nmt",
    bpe_codes="code",
    cpu=True
)
print(model.translate(sentence="Zespół astronomów odkrył w konstelacji Panny niezwykłą planetę.", beam=5))
# A team of astronomers discovered an extraordinary planet in the constellation of Virgo.

Polish-English convolutional model: Download (GitHub)
English-Polish convolutional model: Download (GitHub)

T5-based models

We share MT5 and Flan-T5 models fine-tuned for Polish-English and English-Polish translation. The models were trained on 70 million sentence pairs from OPUS. You can download them from the Hugginface Hub using the links below. An example of how to use the models:

from transformers import pipeline
generator = pipeline("translation", model="sdadas/flan-t5-base-translator-en-pl")
sentence = "A team of astronomers discovered an extraordinary planet in the constellation of Virgo."
print(generator(sentence, max_length=512))
# [{'translation_text': 'Zespół astronomów odkrył niezwykłą planetę w gwiazdozbiorze Panny.'}]

The following models are available on the Huggingface Hub: mt5-base-translator-en-pl, mt5-base-translator-pl-en, flan-t5-base-translator-en-pl

Fine-tuned models

ByT5-text-correction

A small multilingual utility model intended for simple text correction. It is designed to improve the quality of texts from the web, often lacking punctuation or proper word capitalization. The model was trained to perform three types of corrections: restoring punctuation in sentences, restoring word capitalization, and restoring diacritical marks for languages that include them.

The following languages are supported: Belarusian (be), Danish (da), German (de), Greek (el), English (en), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Swedish (sv), Ukrainian (uk). The model takes as input a sentence preceded by a language code prefix. For example:

from transformers import pipeline
generator = pipeline("text2text-generation", model="sdadas/byt5-text-correction")
sentences = [
    "<pl> ciekaw jestem na co licza onuce stawiajace na sykulskiego w nadziei na zwrot ku rosji",
    "<de> die frage die sich die europäer stellen müssen lautet ist es in unserem interesse die krise auf taiwan zu beschleunigen",
    "<ru> при своём рождении 26 августа 1910 года тереза получила имя агнес бояджиу"
]
generator(sentences, max_length=512)
# Ciekaw jestem na co liczą onuce stawiające na Sykulskiego w nadziei na zwrot ku Rosji.
# Die Frage, die sich die Europäer stellen müssen, lautet: Ist es in unserem Interesse, die Krise auf Taiwan zu beschleunigen?
# При своём рождении 26 августа 1910 года Тереза получила имя Агнес Бояджиу.

The model is available on the Huggingface Hub: byt5-text-correction

Text ranking models

We provide a set of text ranking models than can be used in the reranking phase of retrieval augmented generation (RAG) pipelines. Our goal was to build efficient models that combine high accuracy with relatively low computational complexity. We employed Polish RoBERTa language models, fine-tuning them for text ranking task on a large dataset consisting of 1.4 million queries and 10 million documents. The models were trained using two knowledge distillation methods: a standard technique based on mean squared loss (MSE) and the RankNet algorithm that enforces sorting lists of documents according to their relevance to the query. The RankNet metod has proven to be more effective. Below is a summary of the released models:

Model Parameters Training method PIRB
NDCG@10
polish-reranker-base-ranknet 124M RankNet 60.32
polish-reranker-large-ranknet 435M RankNet 62.65
polish-reranker-base-mse 124M MSE 57.50
polish-reranker-large-mse 435M MSE 60.27

The models can be used with sentence-transformers library:

from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-large-ranknet",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Dictionaries and lexicons

Polish, English and foreign person names

This lexicon contains 346 thousand forenames and lastnames labeled as Polish, English or Foreign (other) crawled from multiple Internet sources. Possible labels are: P-N (Polish forename), P-L (Polish lastname), E-N (English forename), E-L (English lastname), F (foreign / other). For each word, there is an additional flag indicating whether this name is also used as a common word in Polish (C for common, U for uncommon).

Download (GitHub)

Named entities extracted from SJP.PL

This dictionary consists mostly of the names of settlements, geographical regions, countries, continents and words derived from them (relational adjectives and inhabitant names). Besides that, it also contains names of popular brands, companies and common abbreviations of institutions' names. This resource was created in a semi-automatic way, by extracting the words and their forms from SJP.PL using a set of heuristic rules and then manually filtering out words that weren't named entities.

Download (GitHub)

Links to external resources

Repositories of linguistic tools and resources

Publicly available large Polish text corpora (> 1GB)

Models supporting Polish language

Sentence analysis (tokenization, lemmatization, POS tagging etc.)

  • SpaCy - A popular library for NLP in Python which includes Polish models for sentence analysis.
  • Stanza - A collection of neural NLP models for many languages from StndordNLP.
  • Trankit - A light-weight transformer-based python toolkit for multilingual natural language processing by the University of Oregon.
  • KRNNT and KFTT - Neural morphosyntactic taggers for Polish.
  • Morfeusz - A classic Polish morphosyntactic tagger.
  • Language Tool - Java-based open source proofreading software for many languages with sentence analysis tools included.
  • Stempel - Algorythmic stemmer for Polish.
  • PoLemma - plT5-based lemmatizer of named entities and multi-word expressions for Polish, available in small, base and large sizes.

Machine translation

Language models

Sentence encoders

Optical character recognition (OCR)

  • Easy OCR - Optical character recognition toolkit with pre-trained models for over 40 languages, including Polish.
  • Tesseract - Popular OCR software developed since 1980s, supporting over 100 languages. For integration with Python, wrappers such as PyTesseract or OCRMyPDF can be used.

Speech processing (speech recognition, text-to-speech, voice cloning etc.)

  • Quartznet - Nvidia NeMo (2021) - Nvidia NeMo is a toolkit for building conversational AI models. Apart from the framework itself, Nvidia also published many models trained using their code, which includes a speech recognition model for Polish based on Quartznet architecture.
  • XLS-R (2021) - XLS-R is a multilingual version of Wav2Vec 2.0 model by Meta AI, which is a large-scale pre-trained model for speech processing. The model is trained in a self-supervised way, so it needs to be fine-tuned for solving specific tasks such as ASR. Several fine-tuned checkpoints for Polish speech recognition exist on the HuggingFace Hub e.g. wav2vec2-large-xlsr-53-polish
  • M-CTC-T (2021) - A speech recognition model from Meta AI, supporting 60 languages including Polish. For more information see Pseudo-Labeling for Massively Multilingual Speech Recognition.
  • Whisper (2022) - Whisper is a model released by OpenAI for ASR and other speech-related tasks, supporting 82 languages. The model is available in five sizes: tiny (39M params), base (74M), small (244M), medium (769M), and large (1.5B). More information can be found in the paper Robust Speech Recognition via Large-Scale Weak Supervision.
  • MMS (2023) - Massively Multilingual Speech (MMS) is a large-scale multilingual speech foundation model published by Meta AI. Along with the pre-trained models, they also released checkpoints fine-tuned for specific tasks such as speech recognition, text to speech, and language identification. For more information see Scaling Speech Technology to 1,000+ Languages.
  • SeamlessM4T (2023) - Multilingual and multitask model trained on text and speech data. It covers almost 100 languages including Polish and can perform automatic speech recognition (ASR) as well as multimodal translation tasks across languages: speech-to-text, text-to-speech, speech-to-speech, text-to-text. See SeamlessM4T — Massively Multilingual & Multimodal Machine Translation for more details.
  • SONAR (2023) - Multilingual embeddings for speech and text with a set of additional models fine-tuned for specific tasks such as text translation, speech-to-text translation, or cross-lingual semantic similarity. See SONAR: Sentence-Level Multimodal and Language-Agnostic Representations for more details.
  • XTTS (2023) - A text-to-speech model that allows voice cloning by using just a 3-second audio sample of the target voice. Supports 13 languages including Polish.

Multimodal models