Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phrases.export_phrases() doesn't yield all bigrams #2973

Closed
o-nc opened this issue Oct 6, 2020 · 3 comments
Closed

phrases.export_phrases() doesn't yield all bigrams #2973

o-nc opened this issue Oct 6, 2020 · 3 comments

Comments

@o-nc
Copy link

o-nc commented Oct 6, 2020

Hallo and thank you for this tool,

phrases.export_phrases() doesn't yield all bigrams when some are part of a bigger n-gram.

If I create a Phrases object with

phrases = gensim.models.phrases.Phrases(sentences, min_count=1, threshold=10, delimiter=b' ', scoring='default')

on the following two sentences

New York City has the largest population of all the cities in the United States .
Every year, many travelers come to the United States to visit New York City .

print(dict(phrases.export_phrases(sentences))) only returns {b'New York': 11.5, b'United States': 11.5}. It should also return {b'York City': 11.5} however.

line 187 of phrases.py should probably be changed to last_uncommon = word . It fixes the problem on my side and seems to be what the code was intended to be.

Thank you,
Olivier NC

macOS-10.14.5-x86_64-i386-64bit
Python 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27)
[Clang 6.0 (clang-600.0.57)]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 0

@piskvorky
Copy link
Owner

piskvorky commented Oct 6, 2020

That module has weird docs written in broken English. E.g. common_terms : list of object – List of common terms, they receive special treatment.… WTF?

@mpenkov I'll try to clean up the code & docs, otherwise another candidate for axing IMO.

@o-nc hopefully I'll get to your question too during the cleanup. Thanks for raising this.

@piskvorky
Copy link
Owner

piskvorky commented Oct 8, 2020

@o-nc if New York phrase is detected in the text New York City, then York City cannot be detected as a phrase (the York token is already "used up").

If you wish to detect multi-word phrases (3grams, 4grams, etc), run Phrases multiple times over itself.

Check out #2976 for a faster & cleaner re-implementation of phrase detection. Feel free to use that, feedback welcome. That PR will be a part of the 4.0.0 Gensim release, planned for this month.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2020

See also #1719 for discussion of the same overlapping-bigram issue, how the "greedy" left-to-right nature of the current code causes it, & some partial ideas for how it might be changed to report the 'better' pairings. (Though, in your test case, as they're perfectly tied, you'd still not get both without yet another choice to return some constituent unigrams more than once, which might surprise more users.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants