-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory and Redis backend support #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! Left a few comments to align with the Jekyll coding style and some helpful Ruby tips. 😄
@@ -0,0 +1,72 @@ | |||
# Author: Sawood Alam <@ibnesayeed> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any author comments in our code – your authorship is present in the Git history and here on GitHub. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be gone. I usually don't add such things, but I have seen many other files here that have Author, License, and some other metadata associated in comments, so I thought it was the style of this code base.
end | ||
|
||
def total_words | ||
@total_words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these, you can use the built-in attr_reader
.
Replace this method with
attr_reader :total_words
It's convenient short-hand for
def total_words
@total_words
end
@total_words += diff | ||
end | ||
|
||
def total_trainings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use attr_reader
here as well. And they can be combined:
attr_reader :total_words, :total_trainings
Just a comma-separated list of symbols, where the symbol is the name of the variable and the name of the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of attr_reader
, but I thought in other backends they wont be attributes. So for the sake of uniformity, I used a more usual method definition. However, I can certainly change it in the Memory backend in the more concise form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be unlikely that any future backend wouldn't have these available as attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ch4s3, in the Redis backend they are not plain in-memory attributes. They are stored in the Redis instance. Keeping them in memory as plain attributes would require to traverse through all the records and when initialized from an existing dataset. This was the whole point of using another backend to keep the counters and other state information in the backend store while keeping the configurations and other similar things in the memory.
@categories[category] ||= Hash.new(0) | ||
end | ||
|
||
def category_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be categories
? They keys of the @categories
Hash correspond to the categories that the user has input, so the user might like to write categories
to get the list of categories that have data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the bayes.rb
file, there is a method named categories
that returns a more natural list of category names. However, in the data structure, those categories are converted to symbols (see CategoryNamer class: name.to_s.tr('_', ' ').capitalize.intern
). I want to keep the transformations and logic away from the storage (backend) class. So, the category_keys
method returns the form that was stored in the data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
options = { language: 'en', | ||
auto_categorize: false, | ||
enable_threshold: false, | ||
threshold: 0.0, | ||
enable_stemmer: true | ||
enable_stemmer: true, | ||
redis_con_str: '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best option here is to use dependency injection. Thus the user does:
memory_backed_classifier = ClassifierReborn::Bayes.new
redis_backed_classifier = ClassifierReborn::Bayes.new(:backend => ClassifierReborn::BayesRedisBackend.new("localhost", "6987"))
So they pass in a :backend
key which is an instance of the relevant backend they wish to use. Then they can configure Redis and we don't have to worry about any backend configurations in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked this approach, thanks! Initially, I thought about letting users directly create instance of the Redis class and pass that as for connection, but that would be too much abstraction. I will go ahead and make the Memory backend as default and let the user pass an instance of the backend to allow override.
@parkr, I have taken care of all your comments. Please have a look at the new changes and suggest for more changes if necessary. Meanwhile, I can work on the Redis backend. |
All tests are passing in both Memory and Redis backends. If a Redis server is not running on the same machine with default configs then Redis backend tests are omitted. Currently, I have just copied the |
Here is an article about DRYing up the test cases so that they can be used as modules, but it is for Minitest. We want similar capability in our test environment. Still looking for my options. |
I was able to put all test cases in a module and include that in both the test classes. Yay! Should the name |
I'd probably leave the file name the same to preserve history, unless you need to change the contents of the tests dramatically. |
@Ch4s3, did you get a chance to review the code changes yet? |
# Conflicts: # .travis.yml # test/bayes/bayesian_test.rb
It is no more a work in progress from my end. Please review the code and let me know if some changes are needed. |
@@ -55,19 +59,20 @@ def train(category, text) | |||
category = CategoryNamer.prepare_name(category) | |||
|
|||
# Add the category dynamically or raise an error | |||
unless @categories.key?(category) | |||
unless category_keys.include?(category) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep .key? include?
and key?
are functionally the same, unless I'm missing something, and I think key is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@categories
was a nested hash which has key?
method. However, category_keys
is an array of just the keys of that hash (which was abstracted to support Redis backend) and arrays don't have key?
method defined. Am I missing something?
This generally looks very good @ibnesayeed! Its a big change, so I need to put a bit more time into looking it over. I'd also like to make sure @parkr is on board because I don't want to cause any upstream problems for Jekyll. |
@Ch4s3 and @parkr do we have a real or pseudo-real (fairly good in size) training set that we can use to train and ask for scores of certain queries without this PR, then we can repeat the same with this PR in both Memory and Redis backends. If the scores come out to be exactly the same then we will have more confidence that nothing was changed while refactoring inline memory operations into a separate backend. I think I am essentially talking about an integration test. |
@ibnesayeed no, but we should. I usually pull down some wikipedia articles to demo the gem, but that might not be enough. |
@Ch4s3 I have tried to test it on a public data set about SMS spam classification. I am yet to match the results, but here is what I did. Cases:
For each of these cases:
Please find the test scripts, data, and results at https://gist.github.com/ibnesayeed/73680bb26c4ea2dc1c55651d126e57fb |
I'll take a look as soon as possible. |
Now that I am back to my desk and had a look at the produced output, they are exactly the same. The gist I posted before had some wrong data initially as I was not flushing the Redis database, so after successive runs, scores based on Redis backend were changing due to persistence of earlier tests. Now I have updated the gist. I am now very confident that the implementation was safely refactored and extended without changing the existing behavior. Coding style and other things can be reviewed obviously. |
Awesome! Maybe we should consider using that data for integration testing. |
That was going to be my next question, if you would be interested in getting that test data added in the repo with some test cases and more tests can be written based on that data later. How should we put a test case for now? Should we just do what I did in that external script, then compare the results of the two backends or should we compare them against a static expected output? |
Actually, now that there are more files in the |
I agree |
PR #102 should do. |
This PR is to support #81
TODO