-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Separate tokenizer from hasher (#162)
* Separate whitespace tokenizer from hasher * Separate stopword filter from hasher * Run tests in deep directories * Separate stemmer from hasher * Separate tests for stopword and tokenizer from hasher's one * Reintroduce method to get hash from clean words * Fix usage of Stopword filter * Add tests for Tokenizer::Token * Add test for TokenFilter::Stemmer * Remove needless conversion * Unite stemmer and stopword filter to whitespace tokenizer * Fix indent * Insert seaparator blank lines between meaningful blocks * Revert "Insert seaparator blank lines between meaningful blocks" This reverts commit 07cf360. Rollback. * Revert "Fix indent" This reverts commit 07e6807. Rollback. * Revert "Unite stemmer and stopword filter to whitespace tokenizer" This reverts commit f256337. They should be used separatedly. * Fix indent * Use meaningful variable name * Describe new modules and classes * Give tokenizer and token filters from outside of hasher * Uniform coding style * Apply enable_stemmer option correctly * Fix invalid URI * Don't give needless parameters * Load required modules * Define default token filters for hasher * Fix path to modules * Add description how to use custom tokenizer * Define token filter to remove symbol only tokens * Fix path to required module * Remove needless parameter * Use langauge option only for stopwords filter * Add test for TokenFilter::Symbol * Remove needless "s" * Add how to use custom token filters * Reject cat token based on regexp * Add tests to custom tokenizer and token filters * Fix usage of custom tokenizer * Add note for custom tokenizer * Describe spec of custom tokenizer at first * Accept lambda as custom token filter and tokenizer * Fix mismatched descriptions about method * Add more tests for custom tokenizer and filters
- Loading branch information
Showing
18 changed files
with
474 additions
and
90 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:lucas@rufy.com) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter converts given tokens to their stemmed versions. | ||
module Stemmer | ||
module_function | ||
|
||
def call(tokens) | ||
tokens.collect do |token| | ||
if token.stemmable? | ||
token.stem | ||
else | ||
token | ||
end | ||
end | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:lucas@rufy.com) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter removes stopwords in the language, from given tokens. | ||
module Stopword | ||
STOPWORDS_PATH = [File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')] | ||
@language = 'en' | ||
|
||
module_function | ||
|
||
def call(tokens) | ||
tokens.reject do |token| | ||
token.maybe_stopword? && | ||
(token.length <= 2 || STOPWORDS[@language].include?(token)) | ||
end | ||
end | ||
|
||
# Add custom path to a new stopword file created by user | ||
def add_custom_stopword_path(path) | ||
STOPWORDS_PATH.unshift(path) | ||
end | ||
|
||
# Create a lazily-loaded hash of stopword data | ||
STOPWORDS = Hash.new do |hash, language| | ||
hash[language] = [] | ||
|
||
STOPWORDS_PATH.each do |path| | ||
if File.exist?(File.join(path, language)) | ||
hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding("utf-8").split | ||
break | ||
end | ||
end | ||
|
||
hash[language] | ||
end | ||
|
||
# Changes the language of stopwords | ||
def language=(language) | ||
@language = language | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:lucas@rufy.com) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module TokenFilter | ||
# This filter removes symbol-only terms, from given tokens. | ||
module Symbol | ||
module_function | ||
|
||
def call(tokens) | ||
tokens.reject do |token| | ||
/[^\s\p{WORD}]/ === token | ||
end | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:lucas@rufy.com) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
module Tokenizer | ||
class Token < String | ||
# The class can be created with one token string and extra attributes. E.g., | ||
# t = ClassifierReborn::Tokenizer::Token.new 'Tokenize', stemmable: true, maybe_stopword: false | ||
# | ||
# Attributes available are: | ||
# stemmable: true Possibility that the token can be stemmed. This must be false for un-stemmable terms, otherwise this should be true. | ||
# maybe_stopword: true Possibility that the token is a stopword. This must be false for terms which never been stopword, otherwise this should be true. | ||
def initialize(string, stemmable: true, maybe_stopword: true) | ||
super(string) | ||
@stemmable = stemmable | ||
@maybe_stopword = maybe_stopword | ||
end | ||
|
||
def stemmable? | ||
@stemmable | ||
end | ||
|
||
def maybe_stopword? | ||
@maybe_stopword | ||
end | ||
|
||
def stem | ||
stemmed = super | ||
self.class.new(stemmed, stemmable: @stemmable, maybe_stopword: @maybe_stopword) | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# encoding: utf-8 | ||
# Author:: Lucas Carlson (mailto:lucas@rufy.com) | ||
# Copyright:: Copyright (c) 2005 Lucas Carlson | ||
# License:: LGPL | ||
|
||
require_relative 'token' | ||
|
||
module ClassifierReborn | ||
module Tokenizer | ||
# This tokenizes given input as white-space separated terms. | ||
# It mainly aims to tokenize sentences written with a space between words, like English, French, and others. | ||
module Whitespace | ||
module_function | ||
|
||
def call(str) | ||
tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word| | ||
Token.new(word, stemmable: true, maybe_stopword: true) | ||
end | ||
symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word| | ||
Token.new(word, stemmable: false, maybe_stopword: false) | ||
end | ||
tokens += symbol_tokens | ||
tokens | ||
end | ||
end | ||
end | ||
end |
Oops, something went wrong.