From 07612fa26cfc3b87617235e34b0707a75df7f9e3 Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Fri, 20 Jan 2023 18:03:48 +0100 Subject: [PATCH] prepare release 0.9.1 --- HISTORY.rst | 11 +++++++++++ README.rst | 30 ++++++++++++++++-------------- simplemma/__init__.py | 2 +- 3 files changed, 28 insertions(+), 15 deletions(-) diff --git a/HISTORY.rst b/HISTORY.rst index 2e16550..35ccc91 100644 --- a/HISTORY.rst +++ b/HISTORY.rst @@ -3,6 +3,17 @@ History ======= +0.9.1 +----- + +* smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31) +* unsupervised approach to affixes activated by default for some languages +* reviewed rules for English and German (less greedy) +* added rules for Dutch, Finnish, Polish and Russian +* improved Russian and Ukrainian language data (#3) +* improved tokenizer + + 0.9.0 ----- diff --git a/README.rst b/README.rst index 9d365de..9fe9916 100644 --- a/README.rst +++ b/README.rst @@ -97,7 +97,9 @@ Chaining several languages can improve coverage, they are used in sequence: 'spaghetto' -For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the ``greedy`` parameter to ``True``. This option also triggers a stronger reduction through a further iteration of the search algorithm, e.g. "angekündigten" → "angekündigt" (standard) → "ankündigen" (greedy). In some cases it may be closer to stemming than to lemmatization. +For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the ``greedy`` parameter to ``True``. + +This option also triggers a stronger reduction through a further iteration of the search algorithm, e.g. "angekündigten" → "angekündigt" (standard) → "ankündigen" (greedy). In some cases it may be closer to stemming than to lemmatization. .. code-block:: python @@ -112,7 +114,7 @@ For certain languages a greedier decomposition is activated by default as it can 'angekündigt' # 1 step: reduction to past participle -Additional function: ``is_known()`` checks if a given word is present in the language data: +The additional function ``is_known()`` checks if a given word is present in the language data: .. code-block:: python @@ -198,7 +200,7 @@ The following languages are available using their `BCP 47 language tag `_ -``de`` German 682 0.95 on UD DE-GSD, see also `German-NLP list `_ -``el`` Greek 182 0.88 on UD EL-GDT -``en`` English 136 0.94 on UD EN-GUM, alternative: `LemmInflect `_ +``de`` German 675 0.95 on UD DE-GSD, see also `German-NLP list `_ +``el`` Greek 181 0.88 on UD EL-GDT +``en`` English 131 0.94 on UD EN-GUM, alternative: `LemmInflect `_ ``enm`` Middle English 38 ``es`` Spanish 665 0.95 on UD ES-GSD ``et`` Estonian 119 low coverage -``fa`` Persian 9 experimental -``fi`` Finnish 3,546 evaluation and alternatives: see `this benchmark `_ +``fa`` Persian 12 experimental +``fi`` Finnish 3,199 see `this benchmark `_ ``fr`` French 217 0.94 on UD FR-GSD ``ga`` Irish 372 ``gd`` Gaelic 48 @@ -236,21 +238,21 @@ Code Language Forms (10³) Acc. Comments ``mk`` Macedonian 56 ``ms`` Malay 14 ``nb`` Norwegian (Bokmål) 617 -``nl`` Dutch 254 0.92 on UD-NL-Alpino +``nl`` Dutch 250 0.92 on UD-NL-Alpino ``nn`` Norwegian (Nynorsk) 56 -``pl`` Polish 3,427 0.91 on UD-PL-PDB +``pl`` Polish 3,211 0.91 on UD-PL-PDB ``pt`` Portuguese 924 0.92 on UD-PT-GSD ``ro`` Romanian 311 -``ru`` Russian 607 alternative: `pymorphy2 `_ -``se`` Northern Sámi 113 experimental +``ru`` Russian 595 alternative: `pymorphy2 `_ +``se`` Northern Sámi 113 ``sk`` Slovak 818 0.92 on UD SK-SNK ``sl`` Slovene 136 ``sq`` Albanian 35 ``sv`` Swedish 658 alternative: `lemmy `_ ``sw`` Swahili 10 experimental -``tl`` Tagalog 33 experimental +``tl`` Tagalog 32 experimental ``tr`` Turkish 1,232 0.89 on UD-TR-Boun -``uk`` Ukrainian 190 alternative: `pymorphy2 `_ +``uk`` Ukrainian 370 alternative: `pymorphy2 `_ ======= ==================== =========== ===== ======================================================================== diff --git a/simplemma/__init__.py b/simplemma/__init__.py index f7f8232..f6c2b72 100644 --- a/simplemma/__init__.py +++ b/simplemma/__init__.py @@ -4,7 +4,7 @@ __author__ = "Adrien Barbaresi" __email__ = "barbaresi@bbaw.de" __license__ = "MIT" -__version__ = "0.9.0" +__version__ = "0.9.1" from .langdetect import in_target_language, lang_detector