-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate tokenizer from hasher #162
Conversation
Sorry for the delay. I'll try to take a look tomorrow. |
I'm super, behind, but I'll get to this soon. |
I finally took a deeper look at this. It looks really nice! If you're interested in continuing, I'll happily merge this once it's done and documented. |
Thank you for reviewing! Currently my hands are full, so I'll write document for this in the next month. |
@piroor no worries, I can relate. I'll probably roll out a release soon, and catch this PR on the next one. |
I will keep an eye on this and will look into the code later when I get some cycles to spare. |
Thanks @ibnesayeed |
01528ca
to
67caa82
Compare
Sorry for this large delay. I added descriptions for newly introduced (separated) classes and modules. Moreover, I've added more changes to make tokenizer and filters customizable. Usage of new options are added to docs/bayes.md. |
@piroor I'll take a look tomorrow. |
This looks pretty good overall. I need to dig in a bit more once we handle #172 in the next day or so. I'll try to target this for a Thanks for you patience! |
Finally, I got a chance to look at it today. It is generally looking good to me except a few places where passing a method would have been easier, but a module is required instead. For example, the filters = [
CatFilter.filter,
ClassifierReborn::TokenFilters::Stopword.filter,
]
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer.tokenize, token_filters: filters This signature will make it easier to write an inline custom tokenizer or filter, while more complex ones can be wrapped in a module when necessary. |
@ibnesayeed the code you suggested won't work as you expected, because filters = [
CatFilter.filter,
ClassifierReborn::TokenFilters::Stopword.filter,
] the But I agree that the option should accept lambda. So I think I should rename both fixed method name |
After the commit 958d3a0, now |
3eeae4e
to
81824f5
Compare
This looks good to me |
The code LGTM! (I have not tested it though). |
Thanks!! |
Thanks for the contribution! |
|
||
filters = [ | ||
CatFilter, | ||
white_filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piroor, are we missing a comma at the end here? Also, the last element of the array has an unnecessary comma.
Now I'm trying to separate tokenizing operations from the hasher, as the first step for #131. I introduced these new modules and classes:
Tokenizer::Whitespace
Tokenizer::Token
TokenFilter::Stopword
TokenFilter::Stemmer
For testability and flexibility, they are stayed separated for now. Next step, I'm planning to introduce some mechanism to switch the tokenizer and related modules.
How about this approach?