Separate tokenizer from hasher #162

piroor · 2017-06-30T12:21:39Z

Now I'm trying to separate tokenizing operations from the hasher, as the first step for #131. I introduced these new modules and classes:

Tokenizer::Whitespace
Tokenizer::Token
TokenFilter::Stopword
TokenFilter::Stemmer

For testability and flexibility, they are stayed separated for now. Next step, I'm planning to introduce some mechanism to switch the tokenizer and related modules.

How about this approach?

Ch4s3 · 2017-07-31T04:10:43Z

Sorry for the delay. I'll try to take a look tomorrow.

Ch4s3 · 2017-11-09T17:52:36Z

I'm super, behind, but I'll get to this soon.

Ch4s3 · 2017-11-20T23:53:54Z

I finally took a deeper look at this. It looks really nice! If you're interested in continuing, I'll happily merge this once it's done and documented.

piroor · 2017-11-21T00:47:28Z

Thank you for reviewing! Currently my hands are full, so I'll write document for this in the next month.

Ch4s3 · 2017-11-21T02:56:53Z

@piroor no worries, I can relate. I'll probably roll out a release soon, and catch this PR on the next one.

ibnesayeed · 2017-12-15T01:37:46Z

I will keep an eye on this and will look into the code later when I get some cycles to spare.

Ch4s3 · 2017-12-15T04:13:14Z

Thanks @ibnesayeed

This reverts commit 07cf360. Rollback.

This reverts commit 07e6807. Rollback.

This reverts commit f256337. They should be used separatedly.

piroor · 2018-01-16T07:18:39Z

Sorry for this large delay. I added descriptions for newly introduced (separated) classes and modules.

Moreover, I've added more changes to make tokenizer and filters customizable. Usage of new options are added to docs/bayes.md.

Ch4s3 · 2018-01-16T21:58:16Z

@piroor I'll take a look tomorrow.

Ch4s3 · 2018-03-02T02:44:04Z

This looks pretty good overall. I need to dig in a bit more once we handle #172 in the next day or so. I'll try to target this for a 2.3 release in the next week.

Thanks for you patience!

ibnesayeed · 2018-03-04T21:28:33Z

Finally, I got a chance to look at it today. It is generally looking good to me except a few places where passing a method would have been easier, but a module is required instead. For example, the :tokenizer and token_filters options could just accept their corresponding methods rather than a module that implements those methods with very specific names. Having some default implementation in modules is still fine as long as we pass methods rather than the modules like below:

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer.tokenize, token_filters: filters

This signature will make it easier to write an inline custom tokenizer or filter, while more complex ones can be wrapped in a module when necessary.

piroor · 2018-03-05T03:48:14Z

@ibnesayeed the code you suggested won't work as you expected, because

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]

the filters are not array of methods themselves, it is an array of returned values from those methods.

But I agree that the option should accept lambda. So I think I should rename both fixed method name tokenize and filter to call, then the option can accept both module and lambda.

piroor · 2018-03-05T06:17:05Z

After the commit 958d3a0, now :tokenizer and :token_filters options accept lambda.

Ch4s3 · 2018-03-05T16:03:03Z

This looks good to me

ibnesayeed · 2018-03-05T16:11:27Z

The code LGTM! (I have not tested it though).

piroor · 2018-03-05T17:06:09Z

Thanks!!

Ch4s3 · 2018-03-05T18:23:59Z

Thanks for the contribution!

ibnesayeed · 2018-03-10T23:31:55Z

docs/bayes.md

+
+filters = [
+  CatFilter,
+  white_filter


@piroor, are we missing a comma at the end here? Also, the last element of the array has an unnecessary comma.

piroor mentioned this pull request Dec 18, 2017

OSS Gate Meetup: Tokyo: 2017-12-18 oss-gate/workshop#706

Closed

piroor added 19 commits January 16, 2018 14:22

Separate whitespace tokenizer from hasher

96d2f1a

Separate stopword filter from hasher

25b112f

Run tests in deep directories

010605c

Separate stemmer from hasher

b66ac69

Separate tests for stopword and tokenizer from hasher's one

3188c52

Reintroduce method to get hash from clean words

4f60a6b

Fix usage of Stopword filter

6b433ee

Add tests for Tokenizer::Token

10d3e3a

Add test for TokenFilter::Stemmer

19d83d9

Remove needless conversion

91523e8

Unite stemmer and stopword filter to whitespace tokenizer

d84caa9

Fix indent

9df9bfa

Insert seaparator blank lines between meaningful blocks

164620c

Revert "Insert seaparator blank lines between meaningful blocks"

b30c9c5

This reverts commit 07cf360. Rollback.

Revert "Fix indent"

c6c88a5

This reverts commit 07e6807. Rollback.

Revert "Unite stemmer and stopword filter to whitespace tokenizer"

56fe374

This reverts commit f256337. They should be used separatedly.

Fix indent

a27c4f3

Use meaningful variable name

d7b2519

Describe new modules and classes

67caa82

piroor force-pushed the separate-tokenizer-from-hasher branch from 01528ca to 67caa82 Compare January 16, 2018 05:23

piroor added 2 commits January 16, 2018 14:58

Give tokenizer and token filters from outside of hasher

ce7bca0

Uniform coding style

d0bdd5b

piroor added 6 commits January 16, 2018 15:46

Remove needless parameter

6880ba5

Use langauge option only for stopwords filter

a9b9639

Add test for TokenFilter::Symbol

35c304e

Remove needless "s"

829a176

Add how to use custom token filters

751b15b

Reject cat token based on regexp

932a0a1

piroor added 4 commits January 16, 2018 16:32

Add tests to custom tokenizer and token filters

14af4d0

Fix usage of custom tokenizer

3c59f44

Add note for custom tokenizer

d856224

Describe spec of custom tokenizer at first

b82e68d

Ch4s3 mentioned this pull request Mar 2, 2018

Fix the corner cases #173

Merged

piroor changed the title ~~Separate tokenizer from hasher (WIP)~~ Separate tokenizer from hasher Mar 2, 2018

Accept lambda as custom token filter and tokenizer

958d3a0

piroor added 2 commits March 5, 2018 15:23

Fix mismatched descriptions about method

7bceef7

Add more tests for custom tokenizer and filters

81824f5

piroor force-pushed the separate-tokenizer-from-hasher branch from 3eeae4e to 81824f5 Compare March 5, 2018 06:24

Ch4s3 merged commit 605b261 into jekyll:master Mar 5, 2018

This was referenced Mar 5, 2018

In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. #176

Open

Minimum length of not-stopword terms should be changable (for each language) #161

Closed

ibnesayeed reviewed Mar 10, 2018

View reviewed changes

Separate tokenizer from hasher #162

Separate tokenizer from hasher #162

Uh oh!

Conversation

piroor commented Jun 30, 2017

Uh oh!

Ch4s3 commented Jul 31, 2017

Uh oh!

Ch4s3 commented Nov 9, 2017

Uh oh!

Ch4s3 commented Nov 20, 2017

Uh oh!

piroor commented Nov 21, 2017

Uh oh!

Ch4s3 commented Nov 21, 2017

Uh oh!

ibnesayeed commented Dec 15, 2017

Uh oh!

Ch4s3 commented Dec 15, 2017

Uh oh!

piroor commented Jan 16, 2018

Uh oh!

Ch4s3 commented Jan 16, 2018

Uh oh!

Ch4s3 commented Mar 2, 2018

Uh oh!

ibnesayeed commented Mar 4, 2018

Uh oh!

piroor commented Mar 5, 2018

Uh oh!

piroor commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ch4s3 commented Mar 5, 2018

Uh oh!

ibnesayeed commented Mar 5, 2018

Uh oh!

piroor commented Mar 5, 2018

Uh oh!

Ch4s3 commented Mar 5, 2018

Uh oh!

ibnesayeed Mar 10, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piroor commented Mar 5, 2018 •

edited

Loading