phrases.export_phrases() doesn't yield all bigrams #2973

o-nc · 2020-10-06T03:28:57Z

Hallo and thank you for this tool,

phrases.export_phrases() doesn't yield all bigrams when some are part of a bigger n-gram.

If I create a Phrases object with

phrases = gensim.models.phrases.Phrases(sentences, min_count=1, threshold=10, delimiter=b' ', scoring='default')

on the following two sentences

New York City has the largest population of all the cities in the United States .
Every year, many travelers come to the United States to visit New York City .

print(dict(phrases.export_phrases(sentences))) only returns {b'New York': 11.5, b'United States': 11.5}. It should also return {b'York City': 11.5} however.

line 187 of phrases.py should probably be changed to last_uncommon = word . It fixes the problem on my side and seems to be what the code was intended to be.

Thank you,
Olivier NC

macOS-10.14.5-x86_64-i386-64bit
Python 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27)
[Clang 6.0 (clang-600.0.57)]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-10-06T19:31:05Z

That module has weird docs written in broken English. E.g. common_terms : list of object – List of common terms, they receive special treatment.… WTF?

@mpenkov I'll try to clean up the code & docs, otherwise another candidate for axing IMO.

@o-nc hopefully I'll get to your question too during the cleanup. Thanks for raising this.

piskvorky · 2020-10-08T11:40:16Z

@o-nc if New York phrase is detected in the text New York City, then York City cannot be detected as a phrase (the York token is already "used up").

If you wish to detect multi-word phrases (3grams, 4grams, etc), run Phrases multiple times over itself.

Check out #2976 for a faster & cleaner re-implementation of phrase detection. Feel free to use that, feedback welcome. That PR will be a part of the 4.0.0 Gensim release, planned for this month.

gojomo · 2020-10-09T00:04:26Z

See also #1719 for discussion of the same overlapping-bigram issue, how the "greedy" left-to-right nature of the current code causes it, & some partial ideas for how it might be changed to report the 'better' pairings. (Though, in your test case, as they're perfectly tied, you'd still not get both without yet another choice to return some constituent unigrams more than once, which might surprise more users.)

piskvorky mentioned this issue Oct 8, 2020

[MRG] Refactor phrases #2976

Merged

piskvorky closed this as completed Oct 8, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phrases.export_phrases() doesn't yield all bigrams #2973

phrases.export_phrases() doesn't yield all bigrams #2973

o-nc commented Oct 6, 2020

piskvorky commented Oct 6, 2020 •

edited

Loading

piskvorky commented Oct 8, 2020 •

edited

Loading

gojomo commented Oct 9, 2020

phrases.export_phrases() doesn't yield all bigrams #2973

phrases.export_phrases() doesn't yield all bigrams #2973

Comments

o-nc commented Oct 6, 2020

piskvorky commented Oct 6, 2020 • edited Loading

piskvorky commented Oct 8, 2020 • edited Loading

gojomo commented Oct 9, 2020

piskvorky commented Oct 6, 2020 •

edited

Loading

piskvorky commented Oct 8, 2020 •

edited

Loading