Skip to content

Commit

Permalink
prepare release 0.9.1
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Jan 20, 2023
1 parent 1e3d0e1 commit 07612fa
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 15 deletions.
11 changes: 11 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,17 @@ History
=======


0.9.1
-----

* smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
* unsupervised approach to affixes activated by default for some languages
* reviewed rules for English and German (less greedy)
* added rules for Dutch, Finnish, Polish and Russian
* improved Russian and Ukrainian language data (#3)
* improved tokenizer


0.9.0
-----

Expand Down
30 changes: 16 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,9 @@ Chaining several languages can improve coverage, they are used in sequence:
'spaghetto'
For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the ``greedy`` parameter to ``True``. This option also triggers a stronger reduction through a further iteration of the search algorithm, e.g. "angekündigten" → "angekündigt" (standard) → "ankündigen" (greedy). In some cases it may be closer to stemming than to lemmatization.
For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the ``greedy`` parameter to ``True``.

This option also triggers a stronger reduction through a further iteration of the search algorithm, e.g. "angekündigten" → "angekündigt" (standard) → "ankündigen" (greedy). In some cases it may be closer to stemming than to lemmatization.


.. code-block:: python
Expand All @@ -112,7 +114,7 @@ For certain languages a greedier decomposition is activated by default as it can
'angekündigt' # 1 step: reduction to past participle
Additional function: ``is_known()`` checks if a given word is present in the language data:
The additional function ``is_known()`` checks if a given word is present in the language data:

.. code-block:: python
Expand Down Expand Up @@ -198,7 +200,7 @@ The following languages are available using their `BCP 47 language tag <https://


======= ==================== =========== ===== ========================================================================
Available languages (2022-09-05)
Available languages (2022-01-20)
-----------------------------------------------------------------------------------------------------------------------
Code Language Forms (10³) Acc. Comments
======= ==================== =========== ===== ========================================================================
Expand All @@ -208,14 +210,14 @@ Code Language Forms (10³) Acc. Comments
``cs`` Czech 187 0.89 on UD CS-PDT
``cy`` Welsh 360
``da`` Danish 554 0.92 on UD DA-DDT, alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``de`` German 682 0.95 on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
``el`` Greek 182 0.88 on UD EL-GDT
``en`` English 136 0.94 on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
``de`` German 675 0.95 on UD DE-GSD, see also `German-NLP list <https://github.com/adbar/German-NLP#Lemmatization>`_
``el`` Greek 181 0.88 on UD EL-GDT
``en`` English 131 0.94 on UD EN-GUM, alternative: `LemmInflect <https://github.com/bjascob/LemmInflect>`_
``enm`` Middle English 38
``es`` Spanish 665 0.95 on UD ES-GSD
``et`` Estonian 119 low coverage
``fa`` Persian 9 experimental
``fi`` Finnish 3,546 evaluation and alternatives: see `this benchmark <https://github.com/aajanki/finnish-pos-accuracy>`_
``fa`` Persian 12 experimental
``fi`` Finnish 3,199 see `this benchmark <https://github.com/aajanki/finnish-pos-accuracy>`_
``fr`` French 217 0.94 on UD FR-GSD
``ga`` Irish 372
``gd`` Gaelic 48
Expand All @@ -236,21 +238,21 @@ Code Language Forms (10³) Acc. Comments
``mk`` Macedonian 56
``ms`` Malay 14
``nb`` Norwegian (Bokmål) 617
``nl`` Dutch 254 0.92 on UD-NL-Alpino
``nl`` Dutch 250 0.92 on UD-NL-Alpino
``nn`` Norwegian (Nynorsk) 56
``pl`` Polish 3,427 0.91 on UD-PL-PDB
``pl`` Polish 3,211 0.91 on UD-PL-PDB
``pt`` Portuguese 924 0.92 on UD-PT-GSD
``ro`` Romanian 311
``ru`` Russian 607 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``se`` Northern Sámi 113 experimental
``ru`` Russian 595 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``se`` Northern Sámi 113
``sk`` Slovak 818 0.92 on UD SK-SNK
``sl`` Slovene 136
``sq`` Albanian 35
``sv`` Swedish 658 alternative: `lemmy <https://github.com/sorenlind/lemmy>`_
``sw`` Swahili 10 experimental
``tl`` Tagalog 33 experimental
``tl`` Tagalog 32 experimental
``tr`` Turkish 1,232 0.89 on UD-TR-Boun
``uk`` Ukrainian 190 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
``uk`` Ukrainian 370 alternative: `pymorphy2 <https://github.com/kmike/pymorphy2/>`_
======= ==================== =========== ===== ========================================================================


Expand Down
2 changes: 1 addition & 1 deletion simplemma/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
__author__ = "Adrien Barbaresi"
__email__ = "barbaresi@bbaw.de"
__license__ = "MIT"
__version__ = "0.9.0"
__version__ = "0.9.1"


from .langdetect import in_target_language, lang_detector
Expand Down

0 comments on commit 07612fa

Please sign in to comment.