You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.
As most of us have by now written a module that tokenizes text in some way, I think it's time to talk about creating 'the' tokenizer kiara module to rule them all.
To start, here are links with currently existing modules that tokenize in some way:
I guess the main question we have to answer: do we want to have several of those modules exist in parallel, or should we try to merge all of them in a single module, which gives the user an input option to select which tokenizing method to use, and which would probably default to 'tokenize-by-word', since my guess is that would work for most scenarios, at least as long as we are talking about western languages. Personally, I would strongly favor the single module. Not sure how to name those options in the end, but those could be the initial methods we provide:
tokenize-by-word
tokenize-by-sentence
tokenize-by-japanese-characters
segment-text-and-tokenize-by-segment (for asian languages, incl. Japanese)
The input schema for this module would (initially be):
table: a table that has 'corpus' qualities (whatever that means, we still have to determine and implement this part in kiara)
tokenize_method: a string that specifies the method to use (default: tokenize-by-word)
There might also be some config options for how exactly to tokenize that may depend on the method used, but I think we can ignore that for now.
The output schema would be an Array (single-column table) with the tokenized texts (which can later be merged with the original input table if necessary).
One thing to consider is that once a module is officially released, we can't change it's interface in a way that is not backward compatible. This would work in the case of a single module though, because we would only add new methods to tokenize, never remove any. For example, Lena mentioned there are ways to tokenize using neural networks, we could add those over time.
One advantage of having a single module would be usability: imagine a UI that lets you specify your corpus table, and select different methods to tokenize, and you can quickly change between them, and see which one fits best for your case.
Another thing to consider is single texts. I think we would probably want our module to be configurable, so it can also be used on single texts (strings), instead of a table colum with multiple texts. So we'd have a config option in our module to control that.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
As most of us have by now written a module that tokenizes text in some way, I think it's time to talk about creating 'the' tokenizer kiara module to rule them all.
To start, here are links with currently existing modules that tokenize in some way:
I guess the main question we have to answer: do we want to have several of those modules exist in parallel, or should we try to merge all of them in a single module, which gives the user an input option to select which tokenizing method to use, and which would probably default to 'tokenize-by-word', since my guess is that would work for most scenarios, at least as long as we are talking about western languages. Personally, I would strongly favor the single module. Not sure how to name those options in the end, but those could be the initial methods we provide:
The input schema for this module would (initially be):
There might also be some config options for how exactly to tokenize that may depend on the method used, but I think we can ignore that for now.
The output schema would be an Array (single-column table) with the tokenized texts (which can later be merged with the original input table if necessary).
One thing to consider is that once a module is officially released, we can't change it's interface in a way that is not backward compatible. This would work in the case of a single module though, because we would only add new methods to tokenize, never remove any. For example, Lena mentioned there are ways to tokenize using neural networks, we could add those over time.
One advantage of having a single module would be usability: imagine a UI that lets you specify your corpus table, and select different methods to tokenize, and you can quickly change between them, and see which one fits best for your case.
Another thing to consider is single texts. I think we would probably want our module to be configurable, so it can also be used on single texts (strings), instead of a table colum with multiple texts. So we'd have a config option in our module to control that.
Beta Was this translation helpful? Give feedback.
All reactions