Awesome NLP Resources for Hungarian

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers - György Orosz

Tools

Notations:

👌 Easy to install and use
🚀 Commercial-friendly license
💯 Pretrained models are available or not needed

Word tokenization, sentence splitting

huntoken 👌🚀💯 Hungarian word and sentence splitter
quntoken 👌🚀💯 New Hungarian tokenizer based on quex, huntoken

Morphology

emMorph (Humor) 💯 Hungarian morphological analyzer based on Humor
emMorphPy 👌💯A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer
hunmorph 🚀💯 is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.
hunmorph-foma 🚀💯 Hungarian morpholical analyzer and generator based on hunmorph.
hunspell 👌🚀💯 is an open-source spell-checker, stemmer and morphological analyzer
lara-hungarian-nlp 👌🚀💯 LARA is a lightweight Python NLP library for ChatBots in Hungarian.
Lemmagen 👌🚀💯 project aims at providing standardized open source multilingual platform for lemmatisation. (Python package for v3 | C# project for v3)
Simplemma 👌🚀💯 is a simple multilingual lemmatizer for Python

PoS / Morphological taggers

hunpos 👌🚀💯 Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
PurePos 👌🚀 Open source morphological tagger based on HunPos
purepos.py 👌🚀 Python wrapper for PurePos

Taggers / Chunkers

HunTag 👌🚀 A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
HunTag3 👌🚀 Improved version of the original HunTag
SzegedNER 👌🚀💯 Named Entity Recognition tool for Hungarian and English
DBpedia Spotlight 👌🚀💯 DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text. Docker image
emBERT 👌🚀💯 is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

Pipelines with Hungarian NLP components

magyarlanc 👌💯 A toolkit for the basic linguistic processing of Hungarian
magyarlanc_spark 👌💯 Spark wrapper for magyarlanc
eszterland 👌💯 Clojurized access to magyarlanc
HuSpaCy 👌🚀💯 Industrial-strength Hungarian Natural Language Processing
huNLP 👌💯 An experimental unified Java and REST API for magyarlanc and szegedNER
hunlp-GATE 💯 GATE plugin containing Hungarian NLP tools as GATE processing resources
Trendminer Hungarian Processing Pipeline 🚀 Hungarian NLP pipeline for social media text analysis (TrendMiner project)
Google Syntaxnet 🚀💯 Neural Models of Syntax
UDPipe 👌🚀💯 is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
polyglot 👌🚀💯 is a natural language pipeline that supports massive multilingual applications.
emtsv 👌💯 is a text processing system with inter-module communication via tsv + REST API
Stanza 👌🚀💯 is a Python NLP Library for Many Human Languages
spaCy StanfordNLP 👌🚀💯 wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline
trankit 👌🚀💯 A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Syntactic parsers

hunpars 🚀💯 A rule based Hungarian syntactical analyzer
HunParse 🚀💯 An NLTK-based parser using KR-style morphological annotation
Anagramma Parser A parser based on psycholinguistics principles
benepar 👌🚀💯 A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

Semantic analysis

SentimentAnalysisHUN 👌🚀💯 is an open-source sentiment analysis tool for Hungarian language, written in Python.
hun-date-parser 👌🚀💯 A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text.
SZTAKI HunSum-1 models 👌🚀💯 mT5-small-HunSum-1, mT5-base-HunSum-1, Bert2Bert-HunSum-1,
poltextLAB's models emotion classification models using 6-label and 9-label codebooks.

Other

emLam 👌🚀💯 Preprocessing scripts for Hungarian Language Modeling
pywnxml 👌🚀💯 Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)
Hun-appointment-chatbot 👌🚀💯 A simple Hungarian chatbot for booking an appointment using the Rasa framework.
neural-punctuator 👌🚀💯 Automatic punctuation restoration with BERT models for English and Hungarian
hunaccent 👌🚀💯 Small Footprint Diacritic Restoration for Hungarian
Diacritics_restoration 🚀💯 Lightweight Diacritics Restoration with Dilated Convolutional Neural Networks
NYTK MT 👌🚀💯 NYTK Machine translation models
syntax-augmentation-nmt 🚀💯 Syntax-based data augmentation for Hungarian-English machine translation
anonymizer_hu 🚀💯 The Hungarian anonymization tool for CURLICAT

Language models

Word embeddings

FasText Wikipedia pre-trained word vectors for 90 languages, trained on Wikipedia using fastText.
FasText Common Crawl & Wikipedia pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model.
FastText_multilingual Multilingual word vectors in 78 languages
polyglot vectors polgyglot embeddings on Wikipedia
wordvectors Pre-trained word2vec and fasttext word vectors on wikipedia of 30+ languages
hunembed0.0 A word2vec word embedding trained on the concatenation of the Hungarian Webcorpus and the Hungarian National Corpus in 600 dimensions with a cut-off of 10 words.
Szeged word vectors Word embeddings (word2vec & fasttext) for Hungarian trained on 4.3 billion tokens
questions-words-hu Hungarian analogical questions following Mikolov et al.
Conceptnet Numberbatch Conceptnet numbermatch multi- and cross-lingual semantic word embeddings
Multi-sense word embeddings
BytePair Embeddings pretrained Subword Embeddings, downloadable in many formats
HuSpaCy 300d 300d Floret embeddings trained on the Hungarian Webcorpus 2.0
HuSpaCy 100d 100d Floret embeddings trained on the Hungarian Webcorpus 2.0
ELMo Representations Deep contextualized word representation trained for many languages

Transformer models

huBERT Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia
HIL* Transformer models Pretrained transformer models provided by HILANCO
PULI-BERT-Large is a Hungarian BERT large model based on MegatronBERT

Large Language models

PULI-GPTrio is a Hungarian-English-Chinese trilingual GPT-NeoX model
PULI-GPT-3SX is a Hungarian GPT-NeoX model
SambaLingo-Hungarian-Base is a pretrained Bi-lingual Hungarian and English model that adapts Llama-2-7b to Hungarian by training on 59 billion tokens from the Hungarian split of the Cultura-X dataset
SambaLingo-Hungarian-Chat is a human aligned chat model trained in Hungarian and English
PULI-GPT-2 is a Hungarian GPT-2 model
PULI-GPT-3SX is a Hungarian GPT-NeoX model (6.7 billion parameter)

LLM Benchmarks

HuLU evaluate is a library for evaluating and training language models on Hungarian tasks within the HuLU benchmark.
(M)MTEB is a dataset and leaderboard comparing 100+ text embedding models across 1000+ languages including Hungarian.

Datasets

Corpora

Raw corpora

Hungarian Webcorpus With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.
Hungarian Webcorpus 2.0 The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)
emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.
web2corpus Automatically created multilingual web corpus
CC-100 Monolingual Datasets from Web Crawl Data

Annotated corpora

CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe, together with word embeddings of dimension 100 computed from lowercased texts by word2vec
OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
HunEmPoli corpus was built using pre-agenda speeches of the Hungarian National Assembly (2014-2018) and consists 764008 tokens/36475 sentences. Aspect level emotion annotation, with 39840 identified emotions, in addition, marked the keywords that evoked the emotion.
The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.
Hungarian sentiment corpus (HuSent) is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]
Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.
Universal Dependencies
Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.
KorKor Pilotcorpus is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations
NerKor is a gold standard named entity annotated corpus containing 1 million tokens.
NerKor 1.41e A 1M+-token Hungarian named entity dataset with ~30 entity types derived from NYTK-NerKor
hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
Mazsola database contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
PrevCons is a database of 21K hapaxes of verbs with verbal prefixes
Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.
HuLU Hungarian Language Understanding Benchmark Kit
- HuCOLA Hungarian Corpus of Linguistic Acceptability
- HuCoPA Hungarian Choice of Plausible Alternatives Corpus
- HuCommitmentBank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator.
- HuSST Hungarian version of the Sentiment Treebank
- HuWNLI Anaphora resolution datasets for Hungarian as an inference task
- HuWS is the Hungarian set of the Winograd schemas
- HuRTE is the Hungarian version of the Recognizing Textual Entailment datasets
HuRC Hungarian Corpus for Reading Comprehension with Commonsense Reasoning
ELTE Poetry Corpus is a database of complete poems of 50 Hungarian canonical poets together with the sound devices of the poems and the grammatical features of words in XML format
ELTE Novel Corpus is a database of 400 Hungarian novels (with the annotation of structural units and the grammatical features of words in TEI XML format)
ELTE Drama Corpus is a database of 58 dramas (with the annotation of structural units and the grammatical features of words in TEI XML format)
HumSum-1 is a dataset containing over 1.1M unique news articles with lead and other metadata
HAPP is the Hungarian translation of the Definite Pronoun Resolution Dataset

Parallel corpora

Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.
SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.
CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl
CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian
Hungarian-Russian Prisoner of War Database
TED talks transcripts parallel corpus sentence aligned TED talks including Hungarian.
TaPaCo Corpus is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database
Duolingo STAPLE is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian
PPDB is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian
OpenSubtitles Corpus contains movie subtitles and alignments for 62 languages, including Hungarian
[OPUS Corpus][https://opus.nlpl.eu] is a growing collection of translated texts from the web
MASSIVE dataset is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation.
PWS is a parallel collection of the Winograd schemas in seven languages (including Hungarian)
[HunSimpleNews](https://huggingface.co/datasets/ELTE-DH/HunSimpleNews is the first Hungarian text simplification corpus that includes the standard and simplified versions of whole documents.
HunSum-1 is a Hungarian-language dataset containing over 1.1M unique news articles with lead and other metadata. The dataset contains articles from 9 major Hungarian news websites.
HunSum-2-abstractive and HunSum-2-extractive are Hungarian-language datasets containing over 1.8M unique news articles with lead and other metadata. The dataset contains articles from 27 major Hungarian news websites.
parallelbible The Parallel Bible Corpus is based on the historical text material of the Old Hungarian Corpus, as its database contains all of the Old and Middle Hungarian Bible translations which are available in this corpus. The King James Bible and three Finnish translations are included in the database as well.

Linguistic resources

morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
huwn Hungarian Wordnet
Hungarian Sentiment Lexicon The dictionaries were manually created on the basis of Wordnet-Affect lexicons.
poltextLAB's sentiment lexicons Highly accurate sentiment lexicons for analysing news data
4lang Concept dictionary using Eilenberg machines
Named Entity lists for Hungarian
Mazsola ISZ lists 500K verb frames extracted from the Mazsola database
Manocska merges verb frames existing databases
PrevLex List of phrasel verbs
panmorph Tagsets and description of Hungarian morphological analysers.
hun_ner_checklist CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition

Linked Open Data

Wikipedia dumps
Wikidata dumps
DBPedia dumps
huwn.rdf Hungarian WordNet in RDF format for the Linked Open Data cloud
Conceptnet An open, multilingual knowledge graph (with partial Hungarian support)

Geo data

OpenStreetMap(OSM) In Hungary the name keys, otherwise the *name:hu
Natural-earth-vector (name_hu imported from wikidata labels)
Who's On First is a gazetteer of places (with Hungarian administrative places )

Speech related data

Other

alpaca_hu_2k is the Hungarian translation of a subset of the Stanford Alpaca prompts.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
README.md		README.md
SUMMARY.md		SUMMARY.md

oroszgy/awesome-hungarian-nlp

Folders and files

Latest commit

History

Repository files navigation

Awesome NLP Resources for Hungarian

Table of contents

Tools

Word tokenization, sentence splitting

Morphology

PoS / Morphological taggers

Taggers / Chunkers

Pipelines with Hungarian NLP components

Syntactic parsers

Semantic analysis

Other

Language models

Word embeddings

Transformer models

Large Language models

LLM Benchmarks

Datasets

Corpora

Raw corpora

Annotated corpora

Parallel corpora

Linguistic resources

Linked Open Data

Geo data

Speech related data

Other

Academy

Journals

Conferences

Institutes

Learning resources

Books

Courses

Tutorials

Communities

Other Hungarian related resource collections

About

Topics

Resources

Stars

Watchers

Forks

Contributors 8