Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor documentation API Reference for gensim.summarization #1709

Merged
merged 29 commits into from
Dec 12, 2017
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1c6009c
Added docstrings in textcleaner.py
yurkai Nov 12, 2017
851b02c
Merge branch 'develop' into fix-1668
menshikh-iv Nov 12, 2017
5cbb184
Added docstrings to bm25.py
yurkai Nov 13, 2017
31be095
syntactic_unit.py docstrings and typo
yurkai Nov 14, 2017
c6c608b
added doctrings for graph modules
yurkai Nov 16, 2017
d5247c1
keywords draft
yurkai Nov 17, 2017
3031cd0
keywords draft updated
yurkai Nov 20, 2017
4d7b0a9
keywords draft updated again
yurkai Nov 21, 2017
2c8ef28
keywords edited
yurkai Nov 22, 2017
254dce7
pagerank started
yurkai Nov 23, 2017
a2c2102
pagerank summarizer docstring added
yurkai Nov 25, 2017
1a87934
fixed types in docstrings in commons, bm25, graph and keywords
yurkai Nov 27, 2017
0ca8332
fixed types, examples and types in docstrings
yurkai Nov 28, 2017
ed188ae
Merge branch 'develop' into fix-1668
menshikh-iv Dec 11, 2017
20b19d6
fix pep8
menshikh-iv Dec 11, 2017
6ec29bf
fix doc build
menshikh-iv Dec 11, 2017
e2a2e60
fix bm25
menshikh-iv Dec 11, 2017
d7056e4
fix graph
menshikh-iv Dec 11, 2017
400966c
fix graph[2]
menshikh-iv Dec 11, 2017
44f617c
fix commons
menshikh-iv Dec 11, 2017
d2fed6c
fix keywords
menshikh-iv Dec 11, 2017
84b0f3a
fix keywords[2]
menshikh-iv Dec 11, 2017
ba8b1b6
fix mz_entropy
menshikh-iv Dec 11, 2017
2a283d7
fix pagerank_weighted
menshikh-iv Dec 12, 2017
6bd1584
fix graph rst
menshikh-iv Dec 12, 2017
7ec89fa
fix summarizer
menshikh-iv Dec 12, 2017
fa5efce
fix syntactic_unit
menshikh-iv Dec 12, 2017
0014d88
fix textcleaner
menshikh-iv Dec 12, 2017
1a0166a
fix
menshikh-iv Dec 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 108 additions & 1 deletion gensim/summarization/bm25.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,72 @@
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""This module contains function of computing BM25 scores for documents in
corpus and helper class `BM25` used in calculations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed link to BM25 algorithm (wiki for example)



Example:
--------
>>> import numpy as np
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
>>> ["black", "cat", "white", "cat"],
>>> ["cat", "outer", "space"],
>>> ["wag", "dog"]
>>> ]
>>> np.round(get_bm25_weights(corpus), 3)
array([[ 1.282, 0.182, 0. ],
[ 0.13 , 1.113, 0. ],
[ 0. , 0. , 1.022]])

Data:
-----
.. data:: PARAM_K1 - free smoothing parameter for BM25.
.. data:: PARAM_B - free smoothing parameter for BM25.
.. data:: EPSILON - constant used for negative idf of document in corpus.
"""


import math
from six import iteritems
from six.moves import xrange


# BM25 parameters.
PARAM_K1 = 1.5
PARAM_B = 0.75
EPSILON = 0.25


class BM25(object):
"""Implementation of Best Matching 25 ranking function.

Attributes
----------
corpus_size : int
Size of corpus (number of documents).
avgdl : float
Average length of document in `corpus`.
corpus : list of (list of str)
Corpus of documents.
f : list of dict
Terms frequencies for each document in `corpus`.
df : dict
Terms frequencies for whole `corpus`.
idf : dict
Inverse document frequency.

"""


def __init__(self, corpus):
"""Presets atributes and runs initialize() function.

Parameters
----------
corpus : list of (list of str)
Corpus of documents.

"""
self.corpus_size = len(corpus)
self.avgdl = sum(float(len(x)) for x in corpus) / self.corpus_size
self.corpus = corpus
Expand All @@ -25,7 +77,12 @@ def __init__(self, corpus):
self.idf = {}
self.initialize()


def initialize(self):
"""Calculates frequncies of terms in documents and in corpus. Also
computes inverse document frequncies.

"""
for document in self.corpus:
frequencies = {}
for word in document:
Expand All @@ -42,7 +99,26 @@ def initialize(self):
for word, freq in iteritems(self.df):
self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)


def get_score(self, document, index, average_idf):
"""Computes BM25 score of given `document` in relation to item of corpus
selected by `index`.

Parameters
----------
document : list of str
Document to be scored.
index : integer
Index of document in corpus selected to score with `document`.
average_idf : float
Average idf in corpus.

Returns
-------
float
BM25 score.

"""
score = 0
for word in document:
if word not in self.f[index]:
Expand All @@ -52,7 +128,24 @@ def get_score(self, document, index, average_idf):
/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.corpus_size / self.avgdl)))
return score


def get_scores(self, document, average_idf):
"""Computes and returns BM25 scores of given `document` in relation to
every item in corpus.

Parameters
----------
document : list of str
Document to be scored.
average_idf : float
Average idf in corpus.

Returns
-------
list of float
BM25 scores.

"""
scores = []
for index in xrange(self.corpus_size):
score = self.get_score(document, index, average_idf)
Expand All @@ -61,6 +154,20 @@ def get_scores(self, document, average_idf):


def get_bm25_weights(corpus):
"""Returns BM25 scores (weights) of documents in corpus. Each document
has to be weighted with every document in given corpus.

Parameters
----------
corpus : list of (list of str)
Corpus of documents.

Returns
-------
list of (list of float)
BM25 scores.

"""
bm25 = BM25(corpus)
average_idf = sum(float(val) for val in bm25.idf.values()) / len(bm25.idf)

Expand Down
45 changes: 45 additions & 0 deletions gensim/summarization/commons.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,46 @@
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""This module provides functions of creatinf graph from sequence of values and
removing of unreachable nodes.


Examples
--------

Create simple graph and add edges. Let's kake a look at nodes.

>>> gg = build_graph(['Felidae', 'Lion', 'Tiger', 'Wolf'])
>>> gg.add_edge(("Felidae", "Lion"))
>>> gg.add_edge(("Felidae", "Tiger"))
>>> gg.nodes()
['Felidae', 'Lion', 'Tiger', 'Wolf']

Remove nodes with no edges.

>>> remove_unreachable_nodes(gg)
>>> gg.nodes()
['Felidae', 'Lion', 'Tiger']

"""

from gensim.summarization.graph import Graph


def build_graph(sequence):
"""Creates and returns graph with given sequence of values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "type" of graph (oriented, etc)?


Parameters
----------
sequence : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of ?

Sequence of values.

Returns
-------
Graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete link to type like

:class:`~gensim. ... ...

here and everywhere (for "gensim-defined" types).

Created graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph produced by sequence


"""
graph = Graph()
for item in sequence:
if not graph.has_node(item):
Expand All @@ -15,6 +51,15 @@ def build_graph(sequence):


def remove_unreachable_nodes(graph):
"""Removes unreachable nodes (nodes with no edges). Works inplace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. Works inplace. -> , inplace.


Parameters
----------
graph : Graph
Given graph.

"""

for node in graph.nodes():
if sum(graph.edge_weight((node, other)) for other in graph.neighbors(node)) == 0:
graph.del_node(node)
Loading