Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freezing Trigram Phrase models yields inconsistent results #3326

Open
finnhacks42 opened this issue Apr 14, 2022 · 7 comments
Open

Freezing Trigram Phrase models yields inconsistent results #3326

finnhacks42 opened this issue Apr 14, 2022 · 7 comments
Assignees
Labels
bug Issue described a bug

Comments

@finnhacks42
Copy link

Problem description

Applying a Trigram phrase model yields different results after freeze()

Code to reproduce

import gensim
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

def pretty_print_articles(articles,name):
    print(name)
    print("-------------")
    for a in articles:
        print(" ".join(a))
    print("-------------")

# show version information
print("Gensim version:",gensim.__version__)

# Make demo data
articles = [
    'high risk of default raises concern at the central bank of brazil',
    'high school students demand raises in pay outside the central bank of brazil',
    'concern about rising prices reducing investment',
    'the school is at high risk',
    'simple noconnector trigram',
    'simple noconnector trigram'
]
for i, a in enumerate(articles):
    articles[i] = a.split(" ")

# build bigram model
bigram = Phrases(articles, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

# apply bigram model
bigram_articles = [bigram[a] for a in articles]

# build trigram model
trigram = Phrases(bigram_articles, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

# apply trigram model
trigram_articles = [trigram[a] for a in bigram_articles]

# verify that bigrams and trigrams working as expected
pretty_print_articles(bigram_articles,'Bigrammed')
pretty_print_articles(trigram_articles,'Trigrammed')

# freeze models
bigram_f = bigram.freeze()
trigram_f = trigram.freeze()

# check that frozen models give the same results
bigram_articles_f = [bigram_f[a] for a in articles]
print('bigram models consistent?:',bigram_articles == bigram_articles_f)

# Problem here!
trigram_articles_f = [trigram_f[a] for a in bigram_articles]
print('trigram articles consistent?:',trigram_articles==trigram_articles_f)

print("Lifecycle events")
print("----------------")
print("bigram",bigram.lifecycle_events)
print("trigram:",trigram.lifecycle_events)
print("bigram_frozen",bigram_f.lifecycle_events)
print("trigram_frozen:",trigram_f.lifecycle_events)
Output
Gensim version: 4.1.2
Bigrammed
-------------
high_risk of default raises concern at the central_bank of brazil
high school students demand raises in pay outside the central_bank of brazil
concern about rising prices reducing investment
the school is at high_risk
simple_noconnector trigram
simple_noconnector trigram
-------------
Trigrammed
-------------
high_risk of default raises concern at the central_bank_of_brazil
high school students demand raises in pay outside the central_bank_of_brazil
concern about rising prices reducing investment
the school is at high_risk
simple_noconnector_trigram
simple_noconnector_trigram
-------------
bigram models consistent?: True
trigram articles consistent?: False
Lifecycle events
----------------
bigram [{'msg': 'built Phrases<45 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.736409', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
trigram: [{'msg': 'built Phrases<40 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749474', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
bigram_frozen [{'msg': 'exported FrozenPhrases<5 phrases, min_count=1, threshold=1> from Phrases<45 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749598', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
trigram_frozen: [{'msg': 'exported FrozenPhrases<0 phrases, min_count=1, threshold=1> from Phrases<40 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749631', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]

Further Info

I think the issue lies in export_phrases. When split is called, it cannot distinguish between '_'s added by the bigram or trigram model.

print("Bigram phrases:",bigram.export_phrases())
print("Trigram phrases",trigram.export_phrases())

Gives:

Bigram phrases: {'high_risk': 7.5, 'central_bank': 11.25, 'bank_of_brazil': 11.25, 'simple_noconnector': 11.25, 'noconnector_trigram': 11.25}
Trigram phrases {}

Versions

Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17
Python 3.8.13 (default, Mar 28 2022, 11:38:47) 
[GCC 7.5.0]
Bits 64
NumPy 1.21.5
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1
@finnhacks42
Copy link
Author

This issue also leads to inconstant results in trigram models saved & reloaded from disk.

@piskvorky piskvorky added the bug Issue described a bug label Apr 15, 2022
@piskvorky piskvorky self-assigned this Apr 15, 2022
@piskvorky
Copy link
Owner

Thanks for reporting. Are you interested in figuring out the cause?

All code lives in the phrases module, and is fairly straightforward.

@finnhacks42
Copy link
Author

Yes, the issues is that phrases are stored in source_vocab as delimiter joined strings, ie 'chief_executive'. When you fit higher order models, these composite words get joined to other words ie 'chief_executive_officer'. However, export phrases just splits on '_' and assumes that the first token is worda, the last wordb and computes score(worda,wordb). Taking the example of 'chief_executive_officer' it would compute score('chief','officer') when it should be computing score('chief_executive','officer').

A hacky workaround is to use a different delimiter for each stage but then you end up with phrasesgram keys like 'chief-executive_officer'.

I think the easiest fix is probably to switch to making the phrase keys tuples. I should have time to work on it and put in a PR next week.

@finnhacks42
Copy link
Author

Looking a bit further into the code, changing phrase keys to tuples would not be completely trivial, as this is a component of the model that is serialized on save, so some more backwards compatibility code would be needed. I also noted this comment # 3.8 => 4.0: phrasegram keys are strings, not tuples with bytestrings which suggests the code has been refactored from tuples to strings in the past.

Do you have any thoughts on whether I should refactor the vocab keys back to being tuples. An alternative would be to update the documentation to make it clear that higher order models must use different delimiters and write a new class to simplify the construction & use of higher order models.

@piskvorky
Copy link
Owner

piskvorky commented Apr 17, 2022

Thanks for looking into this. IIRC we went for strings to save on RAM, tuples introduce a lot memory overhead. These "phrases" models are memory-hungry, by the nature of what they do (but see also #1654).

But if freeze() is broken then that's not acceptable.

Taking the example of 'chief_executive_officer' it would compute score('chief','officer') when it should be computing score('chief_executive','officer').

Why idea why? I don't remember how all this works any more :( Can't we compute simply split on all _? Or calculate the score from full subcomponents? Or was there some algorithmic problem with that, I imagine there's a reason why it works as it does.

@piskvorky
Copy link
Owner

piskvorky commented Apr 17, 2022

.

@finnhacks42
Copy link
Author

I guess the fundamental problem is that if you have 'chief_executive_officer' you don't know if the underlying tokens are 'chief_executive' and 'officer' or 'chief' and 'executive_officer'. You could score on all sub-components (after stripping out connector words) but that would mean new scoring functions that work flexibly on 2 or more words. For example, training a Phrase model over existing bigrams can yield valid bigrams (words newly paired because the previous bigraming has changed their individual frequency) , trigrams(bigram paired with word) & 4-grams (two bigrams get paired).

The issue with freeze() will occur whenever the tokens you are trying to create bigram models over contain the bigram delimiter. Creating trigrams by repeated application of the Phrases with the same delimiter could be seen as a special case of this. The issue could also occur if you trained a plain bigram model over tokens that contained the delimiter (for example if you used '-' as the delimiter and your tokens contained hyphenated words). This is probably ok though, since many models assume you have removed any special characters first (although it should probably be documented, and possibly throw a warning/error). If we did that, then training a higher order model with the same delimiter as the first would give that warning/error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

2 participants