Replies: 5 comments 2 replies
-
Just had a thought and adding it here as to not loose it: I think it would make sense if this module would return, in addition to the pre-processed list of token lists, also a list of changes (or maybe stats of changes, like: "deleted_tokens: 5, changed_tokens: 10, etc.") or both. This would be esp. useful to quickly give the user an idea how much 'impact' certain settings made, without having to manually parse the whole result data. |
Beta Was this translation helpful? Give feedback.
-
I like this idea and I agree. Instead of trying to parse this meta-output, should we just display as is? It seems perhaps overly complicated to know in advance what kinds changes a module will declare and to intercept them.
… On Oct 28, 2021, at 10:46 AM, Mariella CC ***@***.***> wrote:
I think this is exactly the spirit of the wireframe that let users see the changes on initial list
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB3ZFQ5HLG33E3LPNNFSY3UJELW5ANCNFSM5GP64CLQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Just responding a bit to the monolithic vs. modular module... yeah, this is a trade-off we are facing everywhere. The solution might be for modules themselves to pull in code from elsewhere, i.e. for the one-big-module to be actually the consolidation of a range of other modules. But this is maybe too complex and overthinking things. |
Beta Was this translation helpful? Give feedback.
-
Remove punctuationInclude the option to remove punctuation and/or special characters as a standalone option. This would help to remove 'noise', e.g., social media texts, OCR errors. |
Beta Was this translation helpful? Give feedback.
-
Terms frequencyEliminate both extremely common terms and extremely rare terms since such terms make word-topic assignment much more difficult. |
Beta Was this translation helpful? Give feedback.
-
After tokenizing, in a lot of cases there is a need to preprocess, fix and filter the resulting lists of tokens. What we encountered so far:
This list is not exhaustive, and can be extended with other methods that fix up the incoming tokens. For example, we could check for typos and either let the users fix them manually, or try to auto-fix them.
The common thing with all those is that they take a list of token-lists as input, and produce a list of token-lists as output. In addition, each module might or might not have more, custom configuration that a user might need to specify.
The main question I have is whether we should have all of those methods in separate modules, or include them all in one big pre-process-tokens module.
The advantage of having them separate is more modularity, so there is more flexibility in re-using them in different contexts. The advantage of a single big module would be a better UX, because it's easier to find that module, and just activate the methods (via inputs) that are desired in any specific scenario.
In this case, my opinion is that we don't loose a lot if we go the 'one-big-module' route. Each method could be enabled/disabled seperately, which means the big module can be used instead of a 'single-task', specific preprocess module, all that needs to be done is disable all other methods.
The other advantage of doing it like this is that we have some control over the order in which the pre-processing happens. For example, lowercasing should probably always happen before stopword removal (but we can talk about whether that is true or not). This removes the need for whoever assembles the pipeline to worry about how to assemble those specific steps. Plus, of course, the overall resulting pipeline structure will be less complex, with only one module instead of multiple ones.
Maintaining backwards compatibility will also be possible, because all we'll be doing in the future is adding new methods to pre-process the token lists, never removing existing ones.
Beta Was this translation helpful? Give feedback.
All reactions