Phrases keeps learned vocabs as bytestring #2140

midnightradio · 2018-07-26T09:33:17Z

Description

To collect phrase dictionary or to find appropriate parameters for a corpus with the Phrases model, I tried to see vocabularies built inside a trained instance of Pharses. However, there are full of bytestrings stored inside the member vocab. And I found that it intentionally converts all tokens into a bytestring by calling a method any2utf8. I think it is not normal as it produces unexpected behaviour with the docstring example code (below) inside the class.

Steps/Code/Corpus to Reproduce

This is an example code in gensim/models/phrases.py, which shows a way to get vocabulary list after training the model.

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> pruned_words, counters, total_words = Phrases.learn_vocab(sentences, 100)

Expected Results

>>> counters['computer']
2
>>> counters['response_time']
1
>>> counters.keys()
dict_keys(['computer', 'human', 'computer_human', 'interface', 'human_interface', 'interface_computer', 'response', 'computer_response', 'survey', 'response_survey', 'system', 'survey_system', 'time', 'system_time', 'user', 'time_user', 'user_interface', 'interface_system', 'system_user', 'eps', 'user_eps', 'eps_human', 'human_system', 'system_system', 'system_eps', 'eps_response', 'response_time', 'trees', 'user_trees', 'trees_trees', 'graph', 'trees_graph', 'graph_trees', 'minors', 'graph_minors', 'minors_survey', 'survey_graph'])

Actual Results

>>> counters.keys()
dict_keys([b'computer', b'human', b'computer_human', b'interface', b'human_interface', b'interface_computer', b'response', b'computer_response', b'survey', b'response_survey', b'system', b'survey_system', b'time', b'system_time', b'user', b'time_user', b'user_interface', b'interface_system', b'system_user', b'eps', b'user_eps', b'eps_human', b'human_system', b'system_system', b'system_eps', b'eps_response', b'response_time', b'trees', b'user_trees', b'trees_trees', b'graph', b'trees_graph', b'graph_trees', b'minors', b'graph_minors', b'minors_survey', b'survey_graph', 'computer'])
>>> counters['computer']
0
>>> counters['response_time']
0
>>> counters[b'computer']
2
>>> counters[b'response_time']
1

The keys are stored in bytestring and only outputs expected countings with providing bytestring.

Versions

Linux-4.13.0-45-generic-x86_64-with-debian-stretch-sid
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
[GCC 7.2.0]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.5.0
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-07-31T11:27:57Z

Hello @midnightradio, this is expected behavior, as you already suggest, this internally call any2utf8 for all input.

I want to close it, any objections @piskvorky @gojomo ?

piskvorky · 2018-08-03T10:59:17Z

I agree with @midnightradio that is is surprising. I cannot remember why we use bytestrings -- why not simply use unicode? Was it because of less memory in Python 2? (before the unicode improvements in Python 3.3, which make this optimization largely irrelevant)

I think memory was probably the reason (Phrases are hungry).

If the documentation shows incorrect examples (not working), then that's definitely a bug.

This entire module has been on schedule to be replaced by Cython or Bounter for a long time now. The performance and memory saving will be tremendous. In fact, that was one of the main reasons we created Bounter, but then we never applied it.

menshikh-iv · 2018-08-03T18:48:25Z

@piskvorky trouble in python versions (strings again ...) : with python2, example works correctly, with python3 - mentioned behaviour happens, difference in small detail, see an example:

python2

>>> print("hello")
hello
>>> print(b"hello")
hello

python3

>>> print("hello")
hello
>>> print(b"hello")
b'hello'

so, I don't know, what's a correct way here: fixing an example or something else (remember that Bouter are unrelated to this issue anyway, not a topic of discussion).

piskvorky · 2018-08-03T20:46:00Z

But Bounter is very much related -- the whole point of using bytestrings in Phrases was to save memory (unless I'm misremembering). It's an optimization, not a deep API decision. All counting in Phrases should be replaced by Bounter (faster, more memory-efficient).

The documentation examples should of course work in both Python 2 and Python 3, like the rest of Gensim. I consider it a bug if they don't (here and anywhere else).

midnightradio · 2018-08-06T06:53:23Z

In my opinion, documentation should be updated first no matter if it is a bug or not, or it's just a matter of time for contributors with Bounter or Cython.

I think Phrases is very useful. But I took some time to find its usefulness with my research with looking into the unexpected behaviour, which is not caused by internal bug but just a matter of storing a data causes inconvenience. Without understanding the history, it's not easy to understand the behavioral error and high likely to be considered as an unreliable module.

piskvorky · 2018-08-06T12:11:50Z

@midnightradio I agree. Can you send a PR with the documentation update? What sort of language would have made your life easier? (imagine you're explaining to yourself ~3 weeks ago).

menshikh-iv · 2018-08-07T15:29:00Z

So, in this case, let's decide that bug in the docstring example, need to make it cross-python (i.e. correct for both py2 and py3)

piskvorky · 2020-10-10T19:50:40Z

Phrases were reimplemented in #2976, using standard strings. Gensim is now py3 only, and unicode strings representable as ASCII are as efficient as in py2, so this bytestring optimization was no longer needed.

Switching to Bounter for counting would be still awesome though. We have a ticket for that, #1654.

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix and removed need info Not enough information for reproduce an issue, need more info from author labels Aug 7, 2018

piskvorky closed this as completed Oct 10, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrases keeps learned vocabs as bytestring #2140

Phrases keeps learned vocabs as bytestring #2140

midnightradio commented Jul 26, 2018 •

edited

Loading

menshikh-iv commented Jul 31, 2018

piskvorky commented Aug 3, 2018 •

edited

Loading

menshikh-iv commented Aug 3, 2018

piskvorky commented Aug 3, 2018 •

edited

Loading

midnightradio commented Aug 6, 2018

piskvorky commented Aug 6, 2018 •

edited

Loading

menshikh-iv commented Aug 7, 2018

piskvorky commented Oct 10, 2020 •

edited

Loading

Phrases keeps learned vocabs as bytestring #2140

Phrases keeps learned vocabs as bytestring #2140

Comments

midnightradio commented Jul 26, 2018 • edited Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented Jul 31, 2018

piskvorky commented Aug 3, 2018 • edited Loading

menshikh-iv commented Aug 3, 2018

piskvorky commented Aug 3, 2018 • edited Loading

midnightradio commented Aug 6, 2018

piskvorky commented Aug 6, 2018 • edited Loading

menshikh-iv commented Aug 7, 2018

piskvorky commented Oct 10, 2020 • edited Loading

midnightradio commented Jul 26, 2018 •

edited

Loading

piskvorky commented Aug 3, 2018 •

edited

Loading

piskvorky commented Aug 3, 2018 •

edited

Loading

piskvorky commented Aug 6, 2018 •

edited

Loading

piskvorky commented Oct 10, 2020 •

edited

Loading