Pre-process token lists #4

makkus · 2021-10-22T08:20:45Z

makkus
Oct 22, 2021
Maintainer

After tokenizing, in a lot of cases there is a need to preprocess, fix and filter the resulting lists of tokens. What we encountered so far:

lowercasing all tokens
removing tokens that are not alphanumeric, or numbers, or any other form we don't like
removing tokens that are shorter than a certain length
removing stopwords

This list is not exhaustive, and can be extended with other methods that fix up the incoming tokens. For example, we could check for typos and either let the users fix them manually, or try to auto-fix them.

The common thing with all those is that they take a list of token-lists as input, and produce a list of token-lists as output. In addition, each module might or might not have more, custom configuration that a user might need to specify.

The main question I have is whether we should have all of those methods in separate modules, or include them all in one big pre-process-tokens module.

The advantage of having them separate is more modularity, so there is more flexibility in re-using them in different contexts. The advantage of a single big module would be a better UX, because it's easier to find that module, and just activate the methods (via inputs) that are desired in any specific scenario.

In this case, my opinion is that we don't loose a lot if we go the 'one-big-module' route. Each method could be enabled/disabled seperately, which means the big module can be used instead of a 'single-task', specific preprocess module, all that needs to be done is disable all other methods.
The other advantage of doing it like this is that we have some control over the order in which the pre-processing happens. For example, lowercasing should probably always happen before stopword removal (but we can talk about whether that is true or not). This removes the need for whoever assembles the pipeline to worry about how to assemble those specific steps. Plus, of course, the overall resulting pipeline structure will be less complex, with only one module instead of multiple ones.

Maintaining backwards compatibility will also be possible, because all we'll be doing in the future is adding new methods to pre-process the token lists, never removing existing ones.

makkus · 2021-10-28T07:55:11Z

makkus
Oct 28, 2021
Maintainer Author

Just had a thought and adding it here as to not loose it: I think it would make sense if this module would return, in addition to the pre-processed list of token lists, also a list of changes (or maybe stats of changes, like: "deleted_tokens: 5, changed_tokens: 10, etc.") or both.

This would be esp. useful to quickly give the user an idea how much 'impact' certain settings made, without having to manually parse the whole result data.

1 reply

MariellaCC Oct 28, 2021
Maintainer

I think this is exactly the spirit of the wireframe that let users see the changes on initial list

stakats · 2021-10-28T08:59:15Z

stakats
Oct 28, 2021
Maintainer

I like this idea and I agree. Instead of trying to parse this meta-output, should we just display as is? It seems perhaps overly complicated to know in advance what kinds changes a module will declare and to intercept them.

…

On Oct 28, 2021, at 10:46 AM, Mariella CC ***@***.***> wrote: I think this is exactly the spirit of the wireframe that let users see the changes on initial list — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB3ZFQ5HLG33E3LPNNFSY3UJELW5ANCNFSM5GP64CLQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

1 reply

makkus Oct 28, 2021
Maintainer Author

Yes, I think initially we can just display as is, it will be some sort of dict structure with a fixed schema (maybe we can also look into the tropy custom metadata stuff and see whether that would be useful to integrate here). Later on, we can over time write UI elements that match to each of those schemas we create (or maybe it can even use the schema to display something in a smart way), and which make for a better user experience.

stakats · 2021-10-29T11:36:15Z

stakats
Oct 29, 2021
Maintainer

Just responding a bit to the monolithic vs. modular module... yeah, this is a trade-off we are facing everywhere. The solution might be for modules themselves to pull in code from elsewhere, i.e. for the one-big-module to be actually the consolidation of a range of other modules. But this is maybe too complex and overthinking things.

0 replies

lorellav · 2021-11-03T10:43:46Z

lorellav
Nov 3, 2021
Maintainer

Remove punctuation

Include the option to remove punctuation and/or special characters as a standalone option. This would help to remove 'noise', e.g., social media texts, OCR errors.

0 replies

lorellav · 2021-11-03T10:44:42Z

lorellav
Nov 3, 2021
Maintainer

Terms frequency

Eliminate both extremely common terms and extremely rare terms since such terms make word-topic assignment much more difficult.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-process token lists #4

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pre-process token lists #4

makkus Oct 22, 2021 Maintainer

Replies: 5 comments · 2 replies

makkus Oct 28, 2021 Maintainer Author

MariellaCC Oct 28, 2021 Maintainer

stakats Oct 28, 2021 Maintainer

makkus Oct 28, 2021 Maintainer Author

stakats Oct 29, 2021 Maintainer

lorellav Nov 3, 2021 Maintainer

Remove punctuation

lorellav Nov 3, 2021 Maintainer

Terms frequency

makkus
Oct 22, 2021
Maintainer

Replies: 5 comments 2 replies

makkus
Oct 28, 2021
Maintainer Author

MariellaCC Oct 28, 2021
Maintainer

stakats
Oct 28, 2021
Maintainer

makkus Oct 28, 2021
Maintainer Author

stakats
Oct 29, 2021
Maintainer

lorellav
Nov 3, 2021
Maintainer

lorellav
Nov 3, 2021
Maintainer