Skip to content

Releases: tsproisl/SoMeWeTa

v1.8.1

26 Oct 18:54
Compare
Choose a tag to compare

Prefer the 'fork' method for creating the worker processes for parallel tagging, if it is supported by the operating system. This is much faster than the 'spawn' method that is the default on some non-Linux systems (issue #14).

v1.8.0

03 Aug 07:26
Compare
Choose a tag to compare
  • Add option --use-nfkc to the command line interface and option use_nfkc to the constructor of ASPTagger (issue #11). If this option is used, the internal representation of the input data uses Unicode normalization form NFKC. This can be useful for social media input that misuses mathematical symbols for their typographic effects (e.g. “𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘” instead of “Impfausweis”).
  • Add option --sentence-tag to specify an XML tag in the input data that marks sentence boundaries (issue #12). This is particularly useful in combination with the --sentence-tag option of SoMaJo.

v1.7.3

18 Mar 13:30
Compare
Choose a tag to compare
  • Use less memory when loading a model if the ijson library is present and the Python version is at least 3.7 (at least 3.6 for CPython) (issue #9).
  • Restructured code for parallel tagging (issue #8).

v1.7.2

05 Mar 15:09
Compare
Choose a tag to compare
  • Bugfix: Do not choke on chunks of XML that do not contain actual word tokens (usually at the end of a file).
  • Update regular expressions for emojis, emoticons, numbers and URLs.

v1.7.1

07 Nov 10:32
Compare
Choose a tag to compare
  • Fixed an XML-related bug in STTS_IBK_postprocessor script
  • Fixed a minor bug in emoticon regex

v1.7.0

07 Nov 10:10
Compare
Choose a tag to compare
  • Added Reddit links and Reddit-specific emoticons
  • Helper script for tagging multiple files (somewe-tagger-multifile)
  • Postprocessing script for some deterministic tagging decisions in STTS_IBK, e.g. URLs, Emoticons, etc. (STTS_IBK_postprocessor)
  • Moved command-line interface to cli.py

v1.6.2

17 Oct 10:16
Compare
Choose a tag to compare
  • Sanity-check input: Warn if there are extremely long sentences (≥ 500 words) in the input as this might indicate missing sentence boundaries.
  • Fix a numpy DeprecationWarning (issue #5).

v1.6.1

02 Oct 14:41
Compare
Choose a tag to compare
  • New option -v/--version to output version information.
  • Explicitly specify input encoding as UTF-8.
  • Fixed a bug in progress display.

v1.6.0

02 Jul 11:43
Compare
Choose a tag to compare
  • New method tag_xml_sentence for simplified processing of SoMaJo's output for XML files.
  • Updated regular expressions for emojis (taken from SoMaJo).
  • Fixed a bug where SoMeWeTa could not be installed when numpy was not already there.

v1.5.1

19 Jun 11:49
Compare
Choose a tag to compare

Fix issue #3 (FutureWarning about possible nested sets in regular expressions).