Tokenizing #2

makkus · 2021-10-21T07:00:55Z

makkus
Oct 21, 2021
Maintainer

As most of us have by now written a module that tokenizes text in some way, I think it's time to talk about creating 'the' tokenizer kiara module to rule them all.

To start, here are links with currently existing modules that tokenize in some way:

very simple proof-of-concept implementation
Lorellas sentence tokenizer module
Lenas tokenizer for Japanese
Mariellas tokenizer module she used in the tm streamlit prototype
Markus adaptation of Mariellas module, optimized for speed -- at least halfes execution time -- the more cpus that are available, the faster this is, but has problems when used in streamlit (at least on Mac)

I guess the main question we have to answer: do we want to have several of those modules exist in parallel, or should we try to merge all of them in a single module, which gives the user an input option to select which tokenizing method to use, and which would probably default to 'tokenize-by-word', since my guess is that would work for most scenarios, at least as long as we are talking about western languages. Personally, I would strongly favor the single module. Not sure how to name those options in the end, but those could be the initial methods we provide:

tokenize-by-word
tokenize-by-sentence
tokenize-by-japanese-characters
segment-text-and-tokenize-by-segment (for asian languages, incl. Japanese)

The input schema for this module would (initially be):

table: a table that has 'corpus' qualities (whatever that means, we still have to determine and implement this part in kiara)
tokenize_method: a string that specifies the method to use (default: tokenize-by-word)

There might also be some config options for how exactly to tokenize that may depend on the method used, but I think we can ignore that for now.

The output schema would be an Array (single-column table) with the tokenized texts (which can later be merged with the original input table if necessary).

One thing to consider is that once a module is officially released, we can't change it's interface in a way that is not backward compatible. This would work in the case of a single module though, because we would only add new methods to tokenize, never remove any. For example, Lena mentioned there are ways to tokenize using neural networks, we could add those over time.

One advantage of having a single module would be usability: imagine a UI that lets you specify your corpus table, and select different methods to tokenize, and you can quickly change between them, and see which one fits best for your case.

Another thing to consider is single texts. I think we would probably want our module to be configurable, so it can also be used on single texts (strings), instead of a table colum with multiple texts. So we'd have a config option in our module to control that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Tokenizing #2

makkus Oct 21, 2021 Maintainer

Replies: 0 comments

makkus
Oct 21, 2021
Maintainer