-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
phrases.export_phrases() doesn't yield all bigrams #2973
Comments
That module has weird docs written in broken English. E.g. @mpenkov I'll try to clean up the code & docs, otherwise another candidate for axing IMO. @o-nc hopefully I'll get to your question too during the cleanup. Thanks for raising this. |
@o-nc if If you wish to detect multi-word phrases (3grams, 4grams, etc), run Phrases multiple times over itself. Check out #2976 for a faster & cleaner re-implementation of phrase detection. Feel free to use that, feedback welcome. That PR will be a part of the 4.0.0 Gensim release, planned for this month. |
See also #1719 for discussion of the same overlapping-bigram issue, how the "greedy" left-to-right nature of the current code causes it, & some partial ideas for how it might be changed to report the 'better' pairings. (Though, in your test case, as they're perfectly tied, you'd still not get both without yet another choice to return some constituent unigrams more than once, which might surprise more users.) |
Hallo and thank you for this tool,
phrases.export_phrases() doesn't yield all bigrams when some are part of a bigger n-gram.
If I create a Phrases object with
phrases = gensim.models.phrases.Phrases(sentences, min_count=1, threshold=10, delimiter=b' ', scoring='default')
on the following two sentences
New York City has the largest population of all the cities in the United States .
Every year, many travelers come to the United States to visit New York City .
print(dict(phrases.export_phrases(sentences)))
only returns {b'New York': 11.5, b'United States': 11.5}. It should also return {b'York City': 11.5} however.line 187 of phrases.py should probably be changed to
last_uncommon = word
. It fixes the problem on my side and seems to be what the code was intended to be.Thank you,
Olivier NC
macOS-10.14.5-x86_64-i386-64bit
Python 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27)
[Clang 6.0 (clang-600.0.57)]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 0
The text was updated successfully, but these errors were encountered: