Releases: tsproisl/SoMeWeTa
Releases · tsproisl/SoMeWeTa
v1.8.1
Prefer the 'fork' method for creating the worker processes for parallel tagging, if it is supported by the operating system. This is much faster than the 'spawn' method that is the default on some non-Linux systems (issue #14).
v1.8.0
- Add option --use-nfkc to the command line interface and option use_nfkc to the constructor of ASPTagger (issue #11). If this option is used, the internal representation of the input data uses Unicode normalization form NFKC. This can be useful for social media input that misuses mathematical symbols for their typographic effects (e.g. “𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘” instead of “Impfausweis”).
- Add option --sentence-tag to specify an XML tag in the input data that marks sentence boundaries (issue #12). This is particularly useful in combination with the --sentence-tag option of SoMaJo.
v1.7.3
- Use less memory when loading a model if the
ijson
library is present and the Python version is at least 3.7 (at least 3.6 for CPython) (issue #9).
- Restructured code for parallel tagging (issue #8).
v1.7.2
- Bugfix: Do not choke on chunks of XML that do not contain actual word tokens (usually at the end of a file).
- Update regular expressions for emojis, emoticons, numbers and URLs.
v1.7.1
- Fixed an XML-related bug in STTS_IBK_postprocessor script
- Fixed a minor bug in emoticon regex
v1.7.0
- Added Reddit links and Reddit-specific emoticons
- Helper script for tagging multiple files (somewe-tagger-multifile)
- Postprocessing script for some deterministic tagging decisions in STTS_IBK, e.g. URLs, Emoticons, etc. (STTS_IBK_postprocessor)
- Moved command-line interface to cli.py
v1.6.2
- Sanity-check input: Warn if there are extremely long sentences (≥ 500 words) in the input as this might indicate missing sentence boundaries.
- Fix a numpy DeprecationWarning (issue #5).
v1.6.1
- New option
-v
/--version
to output version information.
- Explicitly specify input encoding as UTF-8.
- Fixed a bug in progress display.
v1.6.0
- New method
tag_xml_sentence
for simplified processing of SoMaJo's output for XML files.
- Updated regular expressions for emojis (taken from SoMaJo).
- Fixed a bug where SoMeWeTa could not be installed when numpy was not already there.
v1.5.1
Fix issue #3 (FutureWarning about possible nested sets in regular expressions).