Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where saved Phrases model did not load its connector_words #3116

Merged
merged 4 commits into from
May 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Changes

- LsiModel: Only log top words that actually exist in the dictionary (PR [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091), [@kmurphy4](https://github.com/kmurphy4))
- [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness))
- fix bug when loading saved Phrases model (PR [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116), [@aloknayak29](https://github.com/aloknayak29))
Copy link
Owner

@piskvorky piskvorky May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a script that spits out a formatted list of all merged PRs automatically, during a release?

Copy link
Collaborator

@mpenkov mpenkov May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but I've been keeping track of them as we merge them in smart_open, and I've found that way to be more manageable (less risk of missing PRs, and easier to see what has changed since the last release), so I'm trying out the same approach with gensim.

It's also easier to deal with this while the PR is still in my short term memory. When I'm doing the release, that memory is long gone.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem. One advantage of the script is consistency in formatting – see for example the change here :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've got a script that handles these changes, but multiple versions of it were floating around my work/home machines. It's now committed to the repo, so once we nail down the format of the entries, the formatting should no longer be an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Once the two scripts output stuff in the same format, the consistency problem will be solved, right?

Copy link
Owner

@piskvorky piskvorky May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding, but why do we need two scripts? generate_changelog.py seems to be doing what we need = generate formatted CHANGELOG from all new PRs since the last release. When / why would we use annotate_pr.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using annotate_pr.py for each PR (since a few week ago). I'll use generate_changelog.py for the next release. I'll keep whichever ends up being the most convenient - for now, it seems to be the former, but time will tell.


### Documentation

Expand Down
16 changes: 7 additions & 9 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,15 +391,13 @@ def load(cls, *args, **kwargs):
raise ValueError(f'failed to load {cls.__name__} model, unknown scoring "{model.scoring}"')

# common_terms didn't exist pre-3.?, and was renamed to connector in 4.0.0.
if hasattr(model, "common_terms"):
model.connector_words = model.common_terms
del model.common_terms
else:
logger.warning(
'older version of %s loaded without common_terms attribute, setting connector_words to an empty set',
cls.__name__,
)
model.connector_words = frozenset()
if not hasattr(model, "connector_words"):
if hasattr(model, "common_terms"):
model.connector_words = model.common_terms
del model.common_terms
else:
logger.warning('loaded older version of %s, setting connector_words to an empty set', cls.__name__)
model.connector_words = frozenset()

if not hasattr(model, 'corpus_word_count'):
logger.warning('older version of %s loaded without corpus_word_count', cls.__name__)
Expand Down
45 changes: 32 additions & 13 deletions gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -305,32 +305,42 @@ def test_pruning(self):


class TestPhrasesPersistence(PhrasesData, unittest.TestCase):

def test_save_load_custom_scorer(self):
"""Test saving and loading a Phrases object with a custom scorer."""
bigram = Phrases(self.sentences, min_count=1, threshold=.001, scoring=dumb_scorer)
with temporary_file("test.pkl") as fpath:
bigram = Phrases(self.sentences, min_count=1, threshold=.001, scoring=dumb_scorer)
bigram.save(fpath)
bigram_loaded = Phrases.load(fpath)
test_sentences = [['graph', 'minors', 'survey', 'human', 'interface', 'system']]
seen_scores = list(bigram_loaded.find_phrases(test_sentences).values())

assert all(score == 1 for score in seen_scores)
assert len(seen_scores) == 3 # 'graph minors' and 'survey human' and 'interface system'
test_sentences = [['graph', 'minors', 'survey', 'human', 'interface', 'system']]
seen_scores = list(bigram_loaded.find_phrases(test_sentences).values())

assert all(score == 1 for score in seen_scores)
assert len(seen_scores) == 3 # 'graph minors' and 'survey human' and 'interface system'

def test_save_load(self):
"""Test saving and loading a Phrases object."""
bigram = Phrases(self.sentences, min_count=1, threshold=1)
with temporary_file("test.pkl") as fpath:
bigram.save(fpath)
bigram_loaded = Phrases.load(fpath)

test_sentences = [['graph', 'minors', 'survey', 'human', 'interface', 'system']]
seen_scores = set(round(score, 3) for score in bigram_loaded.find_phrases(test_sentences).values())
assert seen_scores == set([
5.167, # score for graph minors
3.444 # score for human interface
])

def test_save_load_with_connector_words(self):
"""Test saving and loading a Phrases object."""
connector_words = frozenset({'of'})
bigram = Phrases(self.sentences, min_count=1, threshold=1, connector_words=connector_words)
with temporary_file("test.pkl") as fpath:
bigram = Phrases(self.sentences, min_count=1, threshold=1)
bigram.save(fpath)
bigram_loaded = Phrases.load(fpath)
test_sentences = [['graph', 'minors', 'survey', 'human', 'interface', 'system']]
seen_scores = set(round(score, 3) for score in bigram_loaded.find_phrases(test_sentences).values())

assert seen_scores == set([
5.167, # score for graph minors
3.444 # score for human interface
])
assert bigram_loaded.connector_words == connector_words

def test_save_load_string_scoring(self):
"""Test backwards compatibility with a previous version of Phrases with custom scoring."""
Expand Down Expand Up @@ -385,6 +395,15 @@ def test_save_load(self):
bigram_loaded[['graph', 'minors', 'survey', 'human', 'interface', 'system']],
['graph_minors', 'survey', 'human_interface', 'system'])

def test_save_load_with_connector_words(self):
"""Test saving and loading a FrozenPhrases object."""
connector_words = frozenset({'of'})
with temporary_file("test.pkl") as fpath:
bigram = FrozenPhrases(Phrases(self.sentences, min_count=1, threshold=1, connector_words=connector_words))
bigram.save(fpath)
bigram_loaded = FrozenPhrases.load(fpath)
self.assertEqual(bigram_loaded.connector_words, connector_words)

def test_save_load_string_scoring(self):
"""Test saving and loading a FrozenPhrases object with a string scoring parameter.
This should ensure backwards compatibility with the previous version of FrozenPhrases"""
Expand Down