Skip to content

Commit

Permalink
Separate tokenizer from hasher (#162)
Browse files Browse the repository at this point in the history
* Separate whitespace tokenizer from hasher

* Separate stopword filter from hasher

* Run tests in deep directories

* Separate stemmer from hasher

* Separate tests for stopword and tokenizer from hasher's one

* Reintroduce method to get hash from clean words

* Fix usage of Stopword filter

* Add tests for Tokenizer::Token

* Add test for TokenFilter::Stemmer

* Remove needless conversion

* Unite stemmer and stopword filter to whitespace tokenizer

* Fix indent

* Insert seaparator blank lines between meaningful blocks

* Revert "Insert seaparator blank lines between meaningful blocks"

This reverts commit 07cf360.
Rollback.

* Revert "Fix indent"

This reverts commit 07e6807.
Rollback.

* Revert "Unite stemmer and stopword filter to whitespace tokenizer"

This reverts commit f256337.
They should be used separatedly.

* Fix indent

* Use meaningful variable name

* Describe new modules and classes

* Give tokenizer and token filters from outside of hasher

* Uniform coding style

* Apply enable_stemmer option correctly

* Fix invalid URI

* Don't give needless parameters

* Load required modules

* Define default token filters for hasher

* Fix path to modules

* Add description how to use custom tokenizer

* Define token filter to remove symbol only tokens

* Fix path to required module

* Remove needless parameter

* Use langauge option only for stopwords filter

* Add test for TokenFilter::Symbol

* Remove needless "s"

* Add how to use custom token filters

* Reject cat token based on regexp

* Add tests to custom tokenizer and token filters

* Fix usage of custom tokenizer

* Add note for custom tokenizer

* Describe spec of custom tokenizer at first

* Accept lambda as custom token filter and tokenizer

* Fix mismatched descriptions about method

* Add more tests for custom tokenizer and filters
  • Loading branch information
piroor authored and Ch4s3 committed Mar 5, 2018
1 parent 2db156b commit 605b261
Show file tree
Hide file tree
Showing 18 changed files with 474 additions and 90 deletions.
4 changes: 2 additions & 2 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ task default: [:test]
desc 'Run all unit tests'
Rake::TestTask.new(:test) do |t|
t.libs << 'lib'
t.pattern = 'test/*/*_test.rb'
t.pattern = 'test/**/*_test.rb'
t.verbose = true
end

# Run benchmarks
desc 'Run all benchmarks'
Rake::TestTask.new(:bench) do |t|
t.libs << 'lib'
t.pattern = 'test/*/*_benchmark.rb'
t.pattern = 'test/**/*_benchmark.rb'
t.verbose = true
end

Expand Down
71 changes: 71 additions & 0 deletions docs/bayes.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,77 @@ classifier.train("Cat", "I can has cat")
classifier.train("Dog", "I don't always bark at night")
```

## Custom Tokenizer

By default the classifier tokenizes given inputs as a white-space separeted terms.
If you want to use different tokenizer, give it via the `:tokenizer` option.
Tokenizer must be an object having a method named `call`, or a lambda.
The function must return tokens as instances of `ClassifierReborn::Tokenizer::Token`.

```ruby
require 'classifier-reborn'

module BigramTokenizer
module_function
def call(str)
str.each_char
.each_cons(2)
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
end
end

classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer
```

or

```ruby
require 'classifier-reborn'

bigram_tokenizer = lambda do |str|
str.each_char
.each_cons(2)
.map do |chars| ClassifierReborn::Tokenizer::Token.new(chars.join) end
end

classifier = ClassifierReborn::Bayes.new tokenizer: bigram_tokenizer
```

## Custom Token Filters

By default classifier rejects stopwords from tokens.
This behavior is implemented based on filters for tokens.
If you want to use more token filters, give them via the `:token_filter` option.
A filter must be an object having a method named `call`, or a lambda.

```ruby
require 'classifier-reborn'

module CatFilter
module_function
def call(tokens)
tokens.reject do |token|
/cat/i === token
end
end
end

white_filter = lambda do |tokens|
tokens.reject do |token|
/white/i === token
end
done

filters = [
CatFilter,
white_filter
# If you want to reject stopwords too, you need to include stopword filter
# to the list of token filters manually.
ClassifierReborn::TokenFilters::Stopword,
]
classifier = ClassifierReborn::Bayes.new token_filters: filters
```

## Custom Stopwords

The library ships with stopword files in various languages.
Expand Down
22 changes: 18 additions & 4 deletions lib/classifier-reborn/bayes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

require 'set'

require_relative 'extensions/tokenizer/whitespace'
require_relative 'extensions/token_filter/stopword'
require_relative 'extensions/token_filter/stemmer'
require_relative 'category_namer'
require_relative 'backends/bayes_memory_backend'
require_relative 'backends/bayes_redis_backend'
Expand Down Expand Up @@ -50,6 +53,14 @@ def initialize(*args)
@threshold = options[:threshold]
@enable_stemmer = options[:enable_stemmer]
@backend = options[:backend]
@tokenizer = options[:tokenizer] || Tokenizer::Whitespace
@token_filters = options[:token_filters] || [TokenFilter::Stopword]
if @enable_stemmer && !@token_filters.include?(TokenFilter::Stemmer)
@token_filters << TokenFilter::Stemmer
end
if @token_filters.include?(TokenFilter::Stopword)
TokenFilter::Stopword.language = @language
end

populate_initial_categories

Expand All @@ -65,7 +76,8 @@ def initialize(*args)
# b.train "that", "That text"
# b.train "The other", "The other text"
def train(category, text)
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
return if word_hash.empty?
category = CategoryNamer.prepare_name(category)

Expand Down Expand Up @@ -95,7 +107,8 @@ def train(category, text)
# b.train :this, "This text"
# b.untrain :this, "This text"
def untrain(category, text)
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
return if word_hash.empty?
category = CategoryNamer.prepare_name(category)
word_hash.each do |word, count|
Expand All @@ -120,7 +133,8 @@ def untrain(category, text)
# The largest of these scores (the one closest to 0) is the one picked out by #classify
def classifications(text)
score = {}
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
word_hash = Hasher.word_hash(text, @enable_stemmer,
tokenizer: @tokenizer, token_filters: @token_filters)
if word_hash.empty?
category_keys.each do |category|
score[category.to_s] = Float::INFINITY
Expand Down Expand Up @@ -266,7 +280,7 @@ def custom_stopwords(stopwords)
return # Do not overwrite the default
end
end
Hasher::STOPWORDS[@language] = Set.new stopwords
TokenFilter::Stopword::STOPWORDS[@language] = Set.new stopwords
end
end
end
62 changes: 18 additions & 44 deletions lib/classifier-reborn/extensions/hasher.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,63 +5,37 @@

require 'set'

require_relative 'tokenizer/whitespace'
require_relative 'token_filter/stopword'
require_relative 'token_filter/stemmer'

module ClassifierReborn
module Hasher
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]

module_function

# Return a Hash of strings => ints. Each word in the string is stemmed,
# interned, and indexes to its frequency in the document.
def word_hash(str, language = 'en', enable_stemmer = true)
cleaned_word_hash = clean_word_hash(str, language, enable_stemmer)
symbol_hash = word_hash_for_symbols(str.scan(/[^\s\p{WORD}]/))
cleaned_word_hash.merge(symbol_hash)
end

# Return a word hash without extra punctuation or short symbols, just stemmed words
def clean_word_hash(str, language = 'en', enable_stemmer = true)
word_hash_for_words(str.gsub(/[^\p{WORD}\s]/, '').downcase.split, language, enable_stemmer)
end

def word_hash_for_words(words, language = 'en', enable_stemmer = true)
d = Hash.new(0)
words.each do |word|
next unless word.length > 2 && !STOPWORDS[language].include?(word)
if enable_stemmer
d[word.stem.intern] += 1
else
d[word.intern] += 1
def word_hash(str, enable_stemmer = true,
tokenizer: Tokenizer::Whitespace,
token_filters: [TokenFilter::Stopword])
if token_filters.include?(TokenFilter::Stemmer)
unless enable_stemmer
token_filters.reject! do |token_filter|
token_filter == TokenFilter::Stemmer
end
end
else
token_filters << TokenFilter::Stemmer if enable_stemmer
end
words = tokenizer.call(str)
token_filters.each do |token_filter|
words = token_filter.call(words)
end
d
end

# Add custom path to a new stopword file created by user
def add_custom_stopword_path(path)
STOPWORDS_PATH.unshift(path)
end

def word_hash_for_symbols(words)
d = Hash.new(0)
words.each do |word|
d[word.intern] += 1
end
d
end

# Create a lazily-loaded hash of stopword data
STOPWORDS = Hash.new do |hash, language|
hash[language] = []

STOPWORDS_PATH.each do |path|
if File.exist?(File.join(path, language))
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
break
end
end

hash[language]
end
end
end
23 changes: 23 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/stemmer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter converts given tokens to their stemmed versions.
module Stemmer
module_function

def call(tokens)
tokens.collect do |token|
if token.stemmable?
token.stem
else
token
end
end
end
end
end
end
47 changes: 47 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/stopword.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter removes stopwords in the language, from given tokens.
module Stopword
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
@language = 'en'

module_function

def call(tokens)
tokens.reject do |token|
token.maybe_stopword? &&
(token.length <= 2 || STOPWORDS[@language].include?(token))
end
end

# Add custom path to a new stopword file created by user
def add_custom_stopword_path(path)
STOPWORDS_PATH.unshift(path)
end

# Create a lazily-loaded hash of stopword data
STOPWORDS = Hash.new do |hash, language|
hash[language] = []

STOPWORDS_PATH.each do |path|
if File.exist?(File.join(path, language))
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split
break
end
end

hash[language]
end

# Changes the language of stopwords
def language=(language)
@language = language
end
end
end
end
19 changes: 19 additions & 0 deletions lib/classifier-reborn/extensions/token_filter/symbol.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module TokenFilter
# This filter removes symbol-only terms, from given tokens.
module Symbol
module_function

def call(tokens)
tokens.reject do |token|
/[^\s\p{WORD}]/ === token
end
end
end
end
end
35 changes: 35 additions & 0 deletions lib/classifier-reborn/extensions/tokenizer/token.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

module ClassifierReborn
module Tokenizer
class Token < String
# The class can be created with one token string and extra attributes. E.g.,
# t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false
#
# Attributes available are:
# stemmable: true Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true.
# maybe_stopword: true Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true.
def initialize(string, stemmable: true, maybe_stopword: true)
super(string)
@stemmable = stemmable
@maybe_stopword = maybe_stopword
end

def stemmable?
@stemmable
end

def maybe_stopword?
@maybe_stopword
end

def stem
stemmed = super
self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword)
end
end
end
end
27 changes: 27 additions & 0 deletions lib/classifier-reborn/extensions/tokenizer/whitespace.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# encoding: utf-8
# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

require_relative 'token'

module ClassifierReborn
module Tokenizer
# This tokenizes given input as white-space separated terms.
# It mainly aims to tokenize sentences written with a space between words, like English, French, and others.
module Whitespace
module_function

def call(str)
tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word|
Token.new(word, stemmable: true, maybe_stopword: true)
end
symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word|
Token.new(word, stemmable: false, maybe_stopword: false)
end
tokens += symbol_tokens
tokens
end
end
end
end
Loading

0 comments on commit 605b261

Please sign in to comment.