Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix min_count handling in phrases detection using npmi #2072

Merged
merged 6 commits into from
Jul 31, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,7 +661,7 @@ def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, co
len_vocab : int
NOT USED.
min_count: int
NOT USED.
Take into account only bigrams with count above this value.
corpus_word_count : int
Number of words in corpus.

Expand All @@ -671,10 +671,15 @@ def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, co
where :math:`prob(word) = \\frac{word\_count}{corpus\_word\_count}`

"""
pa = worda_count / corpus_word_count
pb = wordb_count / corpus_word_count
pab = bigram_count / corpus_word_count
return log(pab / (pa * pb)) / -log(pab)
if bigram_count > min_count:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be >= IMO.

pa = worda_count / corpus_word_count
pb = wordb_count / corpus_word_count
pab = bigram_count / corpus_word_count
return log(pab / (pa * pb)) / -log(pab)
else:
# Return the value below minimal npmi, to make sure that phrases
# will be created only out of bigrams more frequent than min_count
return -1.1
Copy link
Owner

@piskvorky piskvorky Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind -1.1? Looks too magical.

Deserves a code comment at the very least, but maybe a principled value would be even better? -inf? None?

Copy link

@rafabr4 rafabr4 Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1.

Also. I would use a >= instead of > in if bigram_count > min_count:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I'd be in favor of returning None (for all scoring functions) rather than a number for any min_count violation since that would mean all scorers return the same value that means "min_count not met". But I also don't know if None may break things. What tests fail with None? From a quick look this is probably safe.

I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least.

Copy link

@rafabr4 rafabr4 Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like if score is not None and score > threshold:. This would also require to change the default scorer to return None explicitly when min_count is not met, for consistency across scorers.

Other solution, instead of returning None in scorers, could be return -float('Inf') which would result in 'False' when comparing score > threshold.

I personally like solution 1 more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me.

Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated.

Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to -Inf, I learned something :). LGTM.



def pseudocorpus(source_vocab, sep, common_terms=frozenset()):
Expand Down