Releases · tsproisl/SoMeWeTa

26 Oct 18:54

tsproisl

v1.8.1

3b8ac72

v1.8.1 Latest

Latest

Prefer the 'fork' method for creating the worker processes for parallel tagging, if it is supported by the operating system. This is much faster than the 'spawn' method that is the default on some non-Linux systems (issue #14).

Assets 2

03 Aug 07:26

tsproisl

v1.8.0

3826427

v1.8.0

Add option --use-nfkc to the command line interface and option use_nfkc to the constructor of ASPTagger (issue #11). If this option is used, the internal representation of the input data uses Unicode normalization form NFKC. This can be useful for social media input that misuses mathematical symbols for their typographic effects (e.g. “𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘” instead of “Impfausweis”).
Add option --sentence-tag to specify an XML tag in the input data that marks sentence boundaries (issue #12). This is particularly useful in combination with the --sentence-tag option of SoMaJo.

Assets 2

18 Mar 13:30

tsproisl

v1.7.3

7ecfd8d

v1.7.3

Use less memory when loading a model if the ijson library is present and the Python version is at least 3.7 (at least 3.6 for CPython) (issue #9).
Restructured code for parallel tagging (issue #8).

Assets 2

05 Mar 15:09

tsproisl

v1.7.2

0578103

v1.7.2

Bugfix: Do not choke on chunks of XML that do not contain actual word tokens (usually at the end of a file).
Update regular expressions for emojis, emoticons, numbers and URLs.

Assets 2

07 Nov 10:32

tsproisl

v1.7.1

774363b

v1.7.1

Fixed an XML-related bug in STTS_IBK_postprocessor script
Fixed a minor bug in emoticon regex

Assets 2

07 Nov 10:10

tsproisl

v1.7.0

8e014ef

v1.7.0

Added Reddit links and Reddit-specific emoticons
Helper script for tagging multiple files (somewe-tagger-multifile)
Postprocessing script for some deterministic tagging decisions in STTS_IBK, e.g. URLs, Emoticons, etc. (STTS_IBK_postprocessor)
Moved command-line interface to cli.py

Assets 2

17 Oct 10:16

tsproisl

v1.6.2

12b9a71

v1.6.2

Sanity-check input: Warn if there are extremely long sentences (≥ 500 words) in the input as this might indicate missing sentence boundaries.
Fix a numpy DeprecationWarning (issue #5).

Assets 2

02 Oct 14:41

tsproisl

v1.6.1

3d04da2

v1.6.1

New option -v/--version to output version information.
Explicitly specify input encoding as UTF-8.
Fixed a bug in progress display.

Assets 2

02 Jul 11:43

tsproisl

v1.6.0

f28bf7c

v1.6.0

New method tag_xml_sentence for simplified processing of SoMaJo's output for XML files.
Updated regular expressions for emojis (taken from SoMaJo).
Fixed a bug where SoMeWeTa could not be installed when numpy was not already there.

Assets 2

19 Jun 11:49

tsproisl

v1.5.1

dd039bc

v1.5.1

Fix issue #3 (FutureWarning about possible nested sets in regular expressions).

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: tsproisl/SoMeWeTa

v1.8.1

v1.8.0

v1.7.3

v1.7.2

v1.7.1

v1.7.0

v1.6.2

v1.6.1

v1.6.0

v1.5.1