-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix min_count handling in phrases detection using npmi #2072
Changes from 1 commit
6d980d2
99dce42
61cdbfc
61fa8b2
8123dad
80c5b7e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -661,7 +661,7 @@ def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, co | |
len_vocab : int | ||
NOT USED. | ||
min_count: int | ||
NOT USED. | ||
Take into account only bigrams with count above this value. | ||
corpus_word_count : int | ||
Number of words in corpus. | ||
|
||
|
@@ -671,10 +671,15 @@ def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, co | |
where :math:`prob(word) = \\frac{word\_count}{corpus\_word\_count}` | ||
|
||
""" | ||
pa = worda_count / corpus_word_count | ||
pb = wordb_count / corpus_word_count | ||
pab = bigram_count / corpus_word_count | ||
return log(pab / (pa * pb)) / -log(pab) | ||
if bigram_count > min_count: | ||
pa = worda_count / corpus_word_count | ||
pb = wordb_count / corpus_word_count | ||
pab = bigram_count / corpus_word_count | ||
return log(pab / (pa * pb)) / -log(pab) | ||
else: | ||
# Return the value below minimal npmi, to make sure that phrases | ||
# will be created only out of bigrams more frequent than min_count | ||
return -1.1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the reasoning behind -1.1? Looks too magical. Deserves a code comment at the very least, but maybe a principled value would be even better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1. Also. I would use a >= instead of > in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. I'd be in favor of returning I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like Other solution, instead of returning I personally like solution 1 more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me. Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated. Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to -Inf, I learned something :). LGTM. |
||
|
||
|
||
def pseudocorpus(source_vocab, sep, common_terms=frozenset()): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be
>=
IMO.