Skip to content

Releases: adbar/simplemma

simplemma-1.1.2

19 Nov 16:48
be7435e
Compare
Choose a tag to compare
  • Fix cyclic import by @juanjoDiaz (#148)
  • Fix language detector proportion_in_each_language results by @juanjoDiaz (#150)
  • Init: use explicit re-exports (#151)
  • Fix data written by dictionary pickler by @Dunedan (#156)
  • Add demo rules for Latvian and Estonian (#154, #157)
  • Remove deprecated langdetect submodule (#160)
  • Test: remove dummy pickled data (#161)
  • Language data: upgrade pickle to v5 (#162)

simplemma-1.1.1

08 Aug 12:21
4c775a3
Compare
Choose a tag to compare
  • Fix ModuleNotFoundError and test optional dependencies (#142)
  • Simplify code and add missing type annotations (#144)

simplemma-1.1.0

06 Aug 18:05
67f6e00
Compare
Choose a tag to compare
  • Add a memory-efficient dictionary factory backed by MARISA-tries by @Dunedan in #133
  • Drop support for Python 3.6 & 3.7 by @Dunedan in #134
  • Update setup files (#138)

simplemma-1.0.0

31 May 10:21
6860df6
Compare
Choose a tag to compare

Extensive refactoring by @juanjoDiaz:

  • Series of modular classes
  • Different lemmatization strategies available
  • Customization of dictionary loading and handling (DictionaryFactory)
  • LanguageDetector class with extended options
  • See readme and detailed documentation

Breaking changes:

  • The extensive argument is now greedy
  • The langdetect submodule is now language_detector
    from simplemma.langdetect import ...from simplemma.language_detector import ...

Fixes and improvements:

  • is_known() function now restored to its state in v0.9.0 (full dictionary)
  • More languages and better rules (with @juanjoDiaz)
  • Use binary strings in dictionaries to save memory
  • Dictionary sort before compression by @1over137

Documentation:

  • Classes and general doc pages by @juanjoDiaz
  • Section on classes in the readme by @osma

simplemma-0.9.1

20 Jan 17:07
Compare
Choose a tag to compare

What's Changed

  • smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
  • unsupervised approach to affixes activated by default for some languages
  • reviewed rules for English and German (less greedy)
  • added rules for Dutch, Finnish, Polish and Russian
  • improved Russian and Ukrainian language data (#3)
  • improved tokenizer

Full Changelog: v0.9.0...v0.9.1

simplemma-0.9.0

18 Oct 11:46
Compare
Choose a tag to compare
  • smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
  • added support for Asturian (ast, #20)
  • bug fixes (#18, #26)

simplemma-0.8.2

05 Sep 14:11
Compare
Choose a tag to compare
  • languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
  • fix for slow language detection introduced in 0.7.0

Full Changelog: v0.8.1...v0.8.2

simplemma-0.8.1

01 Sep 12:14
Compare
Choose a tag to compare
  • better rules for English and German
  • inconsistencies fixed for cy, de, en, ga, sv (#16)
  • docs: added language detection and citation info

Full Changelog: v0.8.0...v0.8.1

simplemma-0.8.0

02 Aug 15:37
Compare
Choose a tag to compare
  • code fully type checked, optional pre-compilation with mypyc
  • fixes: logging error (#11), input type (#12)
  • code style: black

Full Changelog: v0.7.0...v0.8.0

simplemma-0.7.0

16 Jun 09:52
Compare
Choose a tag to compare
  • breaking change: language data pre-loading now occurs internally, language codes are now directly provided in lemmatize() call, e.g. simplemma.lemmatize("test", lang="en")
  • faster lemmatization and result cache
  • sentence-aware text_lemmatizer()
  • optional iterators for tokenization and lemmatization

Full Changelog: v0.6.0...v0.7.0