Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding sklearn wrapper for LDA code #932

Merged
merged 48 commits into from
Jan 29, 2017
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
08f417c
adding basic sklearn wrapper for LDA code
AadityaJ Oct 10, 2016
61a6f8c
updating changelog
AadityaJ Oct 11, 2016
66be324
adding test case,adding id2word,deleting showtopics
AadityaJ Oct 16, 2016
cffa95b
adding relevant ipynb
AadityaJ Oct 16, 2016
10badc6
adding transfrom and other get methods and modifying print_topics
AadityaJ Oct 20, 2016
62a4d2f
stylizing code to follow conventions
AadityaJ Oct 21, 2016
b7eff2d
removing redundant default argumen values
AadityaJ Oct 21, 2016
2a193fd
adding partial_fit
AadityaJ Oct 23, 2016
a32f8dc
adding a line in test_sklearn_integration
AadityaJ Dec 9, 2016
a048ddc
using LDAModel as Parent Class
AadityaJ Dec 14, 2016
ac1d28e
adding docs, modifying getparam
AadityaJ Dec 18, 2016
0d6cc0a
changing class name.Adding comments
AadityaJ Dec 19, 2016
5d8c1a6
adding test case for update and transform
AadityaJ Dec 24, 2016
894784c
adding init
AadityaJ Dec 24, 2016
7a5ca4b
updating changes,fixed typo and changing file name
AadityaJ Dec 26, 2016
b35baba
deleted base.py
AadityaJ Dec 26, 2016
13a136d
adding better testPartialFit method and minor changes due to change i…
AadityaJ Dec 26, 2016
682f045
change name of test class
AadityaJ Dec 30, 2016
9fda951
adding changes in classname to ipynb
AadityaJ Dec 30, 2016
380ea5f
Merge branch 'develop' into sklearn_lda
AadityaJ Dec 30, 2016
e2485d4
Updating CHANGELOG.md
AadityaJ Dec 31, 2016
3015896
Updated Main Model. Added fit_predict to class for example
AadityaJ Dec 31, 2016
a76eda4
added sklearn countvectorizer example to ipynb
AadityaJ Dec 31, 2016
97c1530
adding logistic regression example
AadityaJ Jan 4, 2017
20a63ac
adding if condition for csr_matrix to ldamodel
AadityaJ Jan 4, 2017
c0b2c5c
adding check for fit csrmatrix also stylizing code
AadityaJ Jan 4, 2017
bd656a8
Merge branch 'develop' into sklearn_lda
AadityaJ Jan 5, 2017
d749ba0
minor bug.solved, fit should convert X to corpus
AadityaJ Jan 5, 2017
21119c5
removing fit_predict.adding csr_matrix check for update
AadityaJ Jan 6, 2017
14f984b
minor updates in ipynb
AadityaJ Jan 6, 2017
a3895b5
adding rst file
AadityaJ Jan 6, 2017
f832737
removed "basic" , added rst update to log
AadityaJ Jan 6, 2017
bc352a0
changing indentation in texts
AadityaJ Jan 6, 2017
7cc39da
added file preamble, removed unnecessary space
AadityaJ Jan 6, 2017
0ba233c
following more pep8 conventions
AadityaJ Jan 6, 2017
e23a8a4
removing unnecessary comments
AadityaJ Jan 6, 2017
041a32e
changing isinstance csr_matrix to issparse
AadityaJ Jan 7, 2017
e7120f0
changed to hanging indentation
AadityaJ Jan 8, 2017
8a0950d
changing main filename
AadityaJ Jan 8, 2017
bd8bced
changing module name in test
AadityaJ Jan 8, 2017
bb5872b
updating ipynb with main filename
AadityaJ Jan 8, 2017
777576e
changed class name
AadityaJ Jan 8, 2017
e50c3f9
changed file name
AadityaJ Jan 8, 2017
e521269
fixing filename typo
AadityaJ Jan 8, 2017
51931fa
adding html file
AadityaJ Jan 8, 2017
7ba30d6
deleting html file
AadityaJ Jan 8, 2017
82d1fdc
vertical indentation fixes
AadityaJ Jan 8, 2017
4f3441e
adding file to apiref.rst
AadityaJ Jan 10, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@ Changes
=======

0.13.3, 2016-09-26

* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,
[#932](https://github.com/RaRe-Technologies/gensim/pull/932))
* Add online learning feature to word2vec. (@isohyt [#900](https://github.com/RaRe-Technologies/gensim/pull/900))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve merge conflicts. Only one line should be added to changelog. Remove extra 2 lines about other changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge in develop branch to remove merge conflicts

* Tutorial: Reproducing Doc2vec paper result on wikipedia. (@isohyt, [#654](https://github.com/RaRe-Technologies/gensim/pull/654))
* Fixed issue #743 , In word2vec's n_similarity method if atleast one empty list is passed ZeroDivisionError is raised, added test cases in test/test_word2vec.py(@pranay360, #883)
* Added Save/Load interface to AnnoyIndexer for ondex persistence (@fortiema, [#845](https://github.com/RaRe-Technologies/gensim/pull/845))
* Change export_phrases in Phrases model. Fix issue #794 (@AadityaJ,
Expand Down
138 changes: 138 additions & 0 deletions docs/notebooks/sklearn_wrapper.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using wrappers for Scikit learn API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration.base```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The wrapper available (as of now) are :\n",
"* LdaModel (```gensim.sklearn_integration.base.LdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update ipynb with new names of .py file and of the class

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LdaModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use LdaModel begin with importing LdaModel wrapper"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.sklearn_integration.base import LdaModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we will create a dummy set of texts and convert it into a corpus"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the examples to ipynb from https://gist.github.com/AadityaJ/c98da3d01f76f068242c17b5e1593973
Remove the Dummy code from the gist and add that conversion code to your wrapper.
You don't have to use pipeline syntax.

]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from gensim.corpora import mmcorpus, Dictionary\n",
"texts = [['human', 'interface', 'computer'],\n",
" ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
" ['eps', 'user', 'interface', 'system'],\n",
" ['system', 'human', 'system', 'eps'],\n",
" ['user', 'response', 'time'],\n",
" ['trees'],\n",
" ['graph', 'trees'],\n",
" ['graph', 'minors', 'trees'],\n",
" ['graph', 'minors', 'survey']]\n",
"dictionary = Dictionary(texts)\n",
"corpus = [dictionary.doc2bow(text) for text in texts]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then to run the LdaModel on it"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, u'0.271*system + 0.181*eps + 0.181*interface + 0.181*human + 0.091*computer + 0.091*user + 0.001*trees + 0.001*graph + 0.001*time + 0.001*minors'), (1, u'0.166*graph + 0.166*trees + 0.111*user + 0.111*survey + 0.111*response + 0.111*minors + 0.111*time + 0.056*computer + 0.056*system + 0.001*human')]\n"
]
}
],
"source": [
"model=LdaModel(n_topics=2,id2word=dictionary,n_iter=20, random_state=1)\n",
"model.fit(corpus)\n",
"print model.print_topics(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
107 changes: 107 additions & 0 deletions gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
#!/usr/bin/env python
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good filename; please use lower case, with underscores _ to separate expressions where necessary.

# -*- coding: utf-8 -*-
#
# Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
#
"""
scikit learn interface for gensim for easy use of gensim with scikit-learn
follows on scikit learn API conventions
"""
from gensim import models


class SklearnWrapperLdaModel(models.LdaModel,object):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

Actually, not relevant at all, because LdaModel already inherits from object naturally.

"""
Base LDA module
"""
def __init__(self, corpus=None, num_topics=100, id2word=None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent.

distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0,
eval_every=10, iterations=50, gamma_threshold=0.001,
minimum_probability=0.01, random_state=None):
"""
sklearn wrapper for LDA model. derived class for gensim.model.LdaModel
"""
self.corpus = corpus
self.num_topics = num_topics
self.id2word = id2word
self.distributed = distributed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think stuff like distributed would really work in sklearn. Same with training or storing very large models (sklearn makes lots of deep object copies internally).

self.chunksize = chunksize
self.passes = passes
self.update_every = update_every
self.alpha = alpha
self.eta = eta
self.decay = decay
self.offset = offset
self.eval_every = eval_every
self.iterations = iterations
self.gamma_threshold = gamma_threshold
self.minimum_probability = minimum_probability
self.random_state = random_state
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use normal # code comments (not docstring """ comments).

if no fit function is used , then corpus is given in init
"""
if self.corpus:
models.LdaModel.__init__(
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent.

distributed=self.distributed, chunksize=self.chunksize, passes=self.passes,
update_every=self.update_every,alpha=self.alpha, eta=self.eta, decay=self.decay,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after comma (here and everywhere else).

offset=self.offset,eval_every=self.eval_every, iterations=self.iterations,
gamma_threshold=self.gamma_threshold,minimum_probability=self.minimum_probability,
random_state=self.random_state)

def get_params(self, deep=True):
"""
returns all parameters as dictionary.
Warnings: Must for sklearn API.Do not Remove.
"""
if deep:
return {"corpus":self.corpus,"num_topics":self.num_topics,"id2word":self.id2word,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent (here and everywhere else).

"distributed":self.distributed,"chunksize":self.chunksize,"passes":self.passes,
"update_every":self.update_every,"alpha":self.alpha," eta":self.eta," decay":self.decay,
"offset":self.offset,"eval_every":self.eval_every," iterations":self.iterations,
"gamma_threshold":self.gamma_threshold,"minimum_probability":self.minimum_probability,
"random_state":self.random_state}

def set_params(self, **parameters):
"""
set all parameters.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize sentences.

Warnings: Must for sklearn API.Do not Remove.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove not capitalized; spacing around punctuation (here and elsewhere).

Also, what are these "Warnings" for? Are they really necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I provided "Warnings" as a way to not remove the functions in the future(necessary for sklearn API). Sure I can scratch them.

"""
for parameter, value in parameters.items():
self.setattr(parameter, value)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just self.parameter = value?

return self

def fit(self, X):
"""
For fitting corpus into the class object.
calls gensim.model.LdaModel:
>>>gensim.models.LdaModel(corpus=corpus,num_topics=num_topics,id2word=id2word,passes=passes,update_every=update_every,alpha=alpha,iterations=iterations,eta=eta,random_state=random_state)
Warnings: Must for sklearn API.Do not Remove.
"""
self.corpus=X
models.LdaModel.__init__(
self, corpus=X, num_topics=self.num_topics, id2word=self.id2word,
distributed=self.distributed, chunksize=self.chunksize, passes=self.passes,
update_every=self.update_every,alpha=self.alpha, eta=self.eta, decay=self.decay,
offset=self.offset,eval_every=self.eval_every, iterations=self.iterations,
gamma_threshold=self.gamma_threshold,minimum_probability=self.minimum_probability,
random_state=self.random_state)
return self

def transform(self, bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False):
"""
takes as an input a new document (bow) and
Return topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
Warnings: Must for sklearn API.Do not Remove.
"""
return self.get_document_topics(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right -- transform accepts a corpus (~sequence or array of multiple examples), not a single document (~one example).

bow, minimum_probability=minimum_probability,
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics)

def partial_fit(self, X):
"""
train model over X.
"""
self.update(corpus=X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a transform as in line 85 above

6 changes: 6 additions & 0 deletions gensim/sklearn_integration/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""scikit learn wrapper for gensim
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing file preamble (encoding, author, license etc).

Contains various gensim based implementations
which match with scikit-learn standards .
See [1] for complete set of conventions.
[1] http://scikit-learn.org/stable/developers/
"""
60 changes: 60 additions & 0 deletions gensim/test/test_sklearn_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import six
import unittest
import numpy

from gensim.sklearn_integration.SklearnWrapperGensimLdaModel import SklearnWrapperLdaModel
from gensim.corpora import Dictionary
from gensim import matutils

texts = [['complier', 'system', 'computer'],
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation.

['graph', 'flow', 'network', 'graph'],
['loading', 'computer', 'system'],
['user', 'server', 'system'],
['tree','hamiltonian'],
['graph', 'trees'],
['computer', 'kernel', 'malfunction','computer'],
['server','system','computer']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]


class TestLdaModel(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the tests to TestSklearnLDAWrapper

def setUp(self):
self.model=SklearnWrapperLdaModel(id2word=dictionary,num_topics=2,passes=100,minimum_probability=0,random_state=numpy.random.seed(0))
self.model.fit(corpus)

def testPrintTopic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a partial_fit test

topic = self.model.print_topics(2)

for k, v in topic:
self.assertTrue(isinstance(v, six.string_types))
self.assertTrue(isinstance(k, int))

def testTransform(self):
texts_new=['graph','eulerian']
bow = self.model.id2word.doc2bow(texts_new)
doc_topics, word_topics, phi_values = self.model.transform(bow,per_word_topics=True)

for k,v in word_topics:
self.assertTrue(isinstance(v, list))
self.assertTrue(isinstance(k, int))
for k,v in doc_topics:
self.assertTrue(isinstance(v, float))
self.assertTrue(isinstance(k, int))
for k,v in phi_values:
self.assertTrue(isinstance(v, list))
self.assertTrue(isinstance(k, int))

def testPartialFit(self):
for i in range(10):
self.model.partial_fit(X=corpus) # fit against the model again
doc=list(corpus)[0] # transform only the first document
transformed = self.model[doc]
transformed_approx = matutils.sparse2full(transformed, 2) # better approximation
expected=[0.13, 0.87]
passed = numpy.allclose(sorted(transformed_approx), sorted(expected), atol=1e-1)
self.assertTrue(passed)

if __name__ == '__main__':
unittest.main()