Separate tokenizer from hasher (#162)

* Separate whitespace tokenizer from hasher * Separate stopword filter from hasher * Run tests in deep directories * Separate stemmer from hasher * Separate tests for stopword and tokenizer from hasher's one * Reintroduce method to get hash from clean words * Fix usage of Stopword filter * Add tests for Tokenizer::Token * Add test for TokenFilter::Stemmer * Remove needless conversion * Unite stemmer and stopword filter to whitespace tokenizer * Fix indent * Insert seaparator blank lines between meaningful blocks * Revert "Insert seaparator blank lines between meaningful blocks" This reverts commit 07cf360. Rollback. * Revert "Fix indent" This reverts commit 07e6807. Rollback. * Revert "Unite stemmer and stopword filter to whitespace tokenizer" This reverts commit f256337. They should be used separatedly. * Fix indent * Use meaningful variable name * Describe new modules and classes * Give tokenizer and token filters from outside of hasher * Uniform coding style * Apply enable_stemmer option correctly * Fix invalid URI * Don't give needless parameters * Load required modules * Define default token filters for hasher * Fix path to modules * Add description how to use custom tokenizer * Define token filter to remove symbol only tokens * Fix path to required module * Remove needless parameter * Use langauge option only for stopwords filter * Add test for TokenFilter::Symbol * Remove needless "s" * Add how to use custom token filters * Reject cat token based on regexp * Add tests to custom tokenizer and token filters * Fix usage of custom tokenizer * Add note for custom tokenizer * Describe spec of custom tokenizer at first * Accept lambda as custom token filter and tokenizer * Fix mismatched descriptions about method * Add more tests for custom tokenizer and filters
jekyll · Mar 5, 2018 · 605b261 · 605b261
1 parent 2db156b
commit 605b261
Show file tree

Hide file tree

Showing 18 changed files with 474 additions and 90 deletions.
diff --git a/Rakefile b/Rakefile
@@ -21,15 +21,15 @@ task default: [:test]
 desc 'Run all unit tests'
 Rake::TestTask.new(:test) do |t|
   t.libs << 'lib'
-  t.pattern = 'test/*/*_test.rb'
+  t.pattern = 'test/**/*_test.rb'
   t.verbose = true
 end
 
 # Run benchmarks
 desc 'Run all benchmarks'
 Rake::TestTask.new(:bench) do |t|
   t.libs << 'lib'
-  t.pattern = 'test/*/*_benchmark.rb'
+  t.pattern = 'test/**/*_benchmark.rb'
   t.verbose = true
 end
 

diff --git a/docs/bayes.md b/docs/bayes.md
@@ -135,6 +135,77 @@ classifier.train("Cat", "I can has cat")
 classifier.train("Dog", "I don't always bark at night")
 ```
 
+## Custom Tokenizer
+
+By default the classifier tokenizes given inputs as a white-space separeted terms.
+If you want to use different tokenizer, give it via the `:tokenizer` option.
+Tokenizer must be an object having a method named `call`, or a lambda.
+The function must return tokens as instances of `ClassifierReborn::Tokenizer::Token`.
+
+```ruby
+require 'classifier-reborn'
+
+module BigramTokenizer
+  module_function
+  def call(str)
+    str.each_char
+       .each_cons(2)
+       .map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
+  end
+end
+
+classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer
+```
+
+or
+
+```ruby
+require 'classifier-reborn'
+
+bigram_tokenizer = lambda do |str|
+  str.each_char
+     .each_cons(2)
+     .map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
+end
+
+classifier = ClassifierReborn::Bayes.new tokenizer: bigram_tokenizer
+```
+
+## Custom Token Filters
+
+By default classifier rejects stopwords from tokens.
+This behavior is implemented based on filters for tokens.
+If you want to use more token filters, give them via the `:token_filter` option.
+A filter must be an object having a method named `call`, or a lambda.
+
+```ruby
+require 'classifier-reborn'
+
+module CatFilter
+  module_function
+  def call(tokens)
+    tokens.reject do |token|
+      /cat/i === token
+    end
+  end
+end
+
+white_filter = lambda do |tokens|
+  tokens.reject do |token|
+    /white/i === token
+  end
+done
+
+filters = [
+  CatFilter,
+  white_filter
+  # If you want to reject stopwords too, you need to include stopword filter
+  # to the list of token filters manually.
+  ClassifierReborn::TokenFilters::Stopword,
+]
+classifier = ClassifierReborn::Bayes.new token_filters: filters
+```
+
 ## Custom Stopwords
 
 The library ships with stopword files in various languages.

diff --git a/lib/classifier-reborn/bayes.rb b/lib/classifier-reborn/bayes.rb
@@ -4,6 +4,9 @@
 
 require 'set'
 
+require_relative 'extensions/tokenizer/whitespace'
+require_relative 'extensions/token_filter/stopword'
+require_relative 'extensions/token_filter/stemmer'
 require_relative 'category_namer'
 require_relative 'backends/bayes_memory_backend'
 require_relative 'backends/bayes_redis_backend'
@@ -50,6 +53,14 @@ def initialize(*args)
       @threshold           = options[:threshold]
       @enable_stemmer      = options[:enable_stemmer]
       @backend             = options[:backend]
+      @tokenizer           = options[:tokenizer] || Tokenizer::Whitespace
+      @token_filters       = options[:token_filters] || [TokenFilter::Stopword]
+      if @enable_stemmer && !@token_filters.include?(TokenFilter::Stemmer)
+        @token_filters << TokenFilter::Stemmer
+      end
+      if @token_filters.include?(TokenFilter::Stopword)
+        TokenFilter::Stopword.language = @language
+      end
 
       populate_initial_categories
 
@@ -65,7 +76,8 @@ def initialize(*args)
     #     b.train "that", "That text"
     #     b.train "The other", "The other text"
     def train(category, text)
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       return if word_hash.empty?
       category = CategoryNamer.prepare_name(category)
 
@@ -95,7 +107,8 @@ def train(category, text)
     #     b.train :this, "This text"
     #     b.untrain :this, "This text"
     def untrain(category, text)
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       return if word_hash.empty?
       category = CategoryNamer.prepare_name(category)
       word_hash.each do |word, count|
@@ -120,7 +133,8 @@ def untrain(category, text)
     # The largest of these scores (the one closest to 0) is the one picked out by #classify
     def classifications(text)
       score = {}
-      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      word_hash = Hasher.word_hash(text, @enable_stemmer,
+                                   tokenizer: @tokenizer, token_filters: @token_filters)
       if word_hash.empty?
         category_keys.each do |category|
           score[category.to_s] = Float::INFINITY
@@ -266,7 +280,7 @@ def custom_stopwords(stopwords)
           return # Do not overwrite the default
         end
       end
-      Hasher::STOPWORDS[@language] = Set.new stopwords
+      TokenFilter::Stopword::STOPWORDS[@language] = Set.new stopwords
     end
   end
 end
diff --git a/lib/classifier-reborn/extensions/hasher.rb b/lib/classifier-reborn/extensions/hasher.rb
@@ -5,63 +5,37 @@
 
 require 'set'
 
+require_relative 'tokenizer/whitespace'
+require_relative 'token_filter/stopword'
+require_relative 'token_filter/stemmer'
+
 module ClassifierReborn
   module Hasher
-    STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
-
     module_function
 
     # Return a Hash of strings => ints. Each word in the string is stemmed,
     # interned, and indexes to its frequency in the document.
-    def word_hash(str, language = 'en', enable_stemmer = true)
-      cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
-      symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
-      cleaned_word_hash.merge(symbol_hash)
-    end
-
-    # Return a word hash without extra punctuation or short symbols, just stemmed words
-    def clean_word_hash(str, language = 'en', enable_stemmer = true)
-      word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer)
-    end
-
-    def word_hash_for_words(words, language = 'en', enable_stemmer = true)
-      d = Hash.new(0)
-      words.each do |word|
-        next unless word.length > 2 && !STOPWORDS[language].include?(word)
-        if enable_stemmer
-          d[word.stem.intern] += 1
-        else
-          d[word.intern] += 1
+    def word_hash(str, enable_stemmer = true,
+                  tokenizer: Tokenizer::Whitespace,
+                  token_filters: [TokenFilter::Stopword])
+      if token_filters.include?(TokenFilter::Stemmer)
+        unless enable_stemmer
+          token_filters.reject! do |token_filter|
+            token_filter == TokenFilter::Stemmer
+          end
         end
+      else
+        token_filters << TokenFilter::Stemmer if enable_stemmer
+      end
+      words = tokenizer.call(str)
+      token_filters.each do |token_filter|
+        words = token_filter.call(words)
       end
-      d
-    end
-
-    # Add custom path to a new stopword file created by user
-    def add_custom_stopword_path(path)
-      STOPWORDS_PATH.unshift(path)
-    end
-
-    def word_hash_for_symbols(words)
       d = Hash.new(0)
       words.each do |word|
         d[word.intern] += 1
       end
       d
     end
-
-    # Create a lazily-loaded hash of stopword data
-    STOPWORDS = Hash.new do |hash, language|
-      hash[language] = []
-
-      STOPWORDS_PATH.each do |path|
-        if File.exist?(File.join(path, language))
-          hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
-          break
-        end
-      end
-
-      hash[language]
-    end
   end
 end
diff --git a/lib/classifier-reborn/extensions/token_filter/stemmer.rb b/lib/classifier-reborn/extensions/token_filter/stemmer.rb
@@ -0,0 +1,23 @@
+# encoding: utf-8
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+
+module ClassifierReborn
+  module TokenFilter
+    # This filter converts given tokens to their stemmed versions.
+    module Stemmer
+      module_function
+
+      def call(tokens)
+        tokens.collect do |token|
+          if token.stemmable?
+            token.stem
+          else
+            token
+          end
+        end
+      end
+    end
+  end
+end
diff --git a/lib/classifier-reborn/extensions/token_filter/stopword.rb b/lib/classifier-reborn/extensions/token_filter/stopword.rb
@@ -0,0 +1,47 @@
+# encoding: utf-8
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+
+module ClassifierReborn
+  module TokenFilter
+    # This filter removes stopwords in the language, from given tokens.
+    module Stopword
+      STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
+      @language = 'en'
+
+      module_function
+
+      def call(tokens)
+        tokens.reject do |token|
+          token.maybe_stopword? &&
+            (token.length <= 2 || STOPWORDS[@language].include?(token))
+        end
+      end
+
+      # Add custom path to a new stopword file created by user
+      def add_custom_stopword_path(path)
+        STOPWORDS_PATH.unshift(path)
+      end
+
+      # Create a lazily-loaded hash of stopword data
+      STOPWORDS = Hash.new do |hash, language|
+        hash[language] = []
+
+        STOPWORDS_PATH.each do |path|
+          if File.exist?(File.join(path, language))
+            hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
+            break
+          end
+        end
+
+        hash[language]
+      end
+
+      # Changes the language of stopwords
+      def language=(language)
+        @language = language
+      end
+    end
+  end
+end
diff --git a/lib/classifier-reborn/extensions/token_filter/symbol.rb b/lib/classifier-reborn/extensions/token_filter/symbol.rb
@@ -0,0 +1,19 @@
+# encoding: utf-8
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+
+module ClassifierReborn
+  module TokenFilter
+    # This filter removes symbol-only terms, from given tokens.
+    module Symbol
+      module_function
+
+      def call(tokens)
+        tokens.reject do |token|
+          /[^\s\p{WORD}]/ === token
+        end
+      end
+    end
+  end
+end
diff --git a/lib/classifier-reborn/extensions/tokenizer/token.rb b/lib/classifier-reborn/extensions/tokenizer/token.rb
@@ -0,0 +1,35 @@
+# encoding: utf-8
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+
+module ClassifierReborn
+  module Tokenizer
+    class Token < String
+      # The class can be created with one token string and extra attributes. E.g.,
+      #      t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false
+      #
+      # Attributes available are:
+      #   stemmable:        true  Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true.
+      #   maybe_stopword:   true  Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true.
+      def initialize(string, stemmable: true, maybe_stopword: true)
+        super(string)
+        @stemmable = stemmable
+        @maybe_stopword = maybe_stopword
+      end
+
+      def stemmable?
+        @stemmable
+      end
+
+      def maybe_stopword?
+        @maybe_stopword
+      end
+
+      def stem
+        stemmed = super
+        self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword)
+      end
+    end
+  end
+end
diff --git a/lib/classifier-reborn/extensions/tokenizer/whitespace.rb b/lib/classifier-reborn/extensions/tokenizer/whitespace.rb
@@ -0,0 +1,27 @@
+# encoding: utf-8
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2005 Lucas Carlson
+# License::   LGPL
+
+require_relative 'token'
+
+module ClassifierReborn
+  module Tokenizer
+    # This tokenizes given input as white-space separated terms.
+    # It mainly aims to tokenize sentences written with a space between words, like English, French, and others.
+    module Whitespace
+      module_function
+
+      def call(str)
+        tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word|
+          Token.new(word, stemmable: true, maybe_stopword: true)
+        end
+        symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word|
+          Token.new(word, stemmable: false, maybe_stopword: false)
+        end
+        tokens += symbol_tokens
+        tokens
+      end
+    end
+  end
+end