D2VTransformer raises if passed a Pandas series without index key 0 #2556

rpetchler · 2019-07-14T16:23:13Z

D2VTransformer raises if passed a Pandas series with an index that does not contain the key 0:

import pandas as pd
from gensim.sklearn_api import D2VTransformer
from gensim.test.utils import common_texts

series = pd.Series(common_texts)
series.index += 1  # Increment the index so that it does not contain the key 0

transformer = D2VTransformer(min_count=1, size=5)
transformer.fit(series)

Output:

Traceback (most recent call last):
  File "main.py", line 9, in <module>
    transformer.fit(series)
  File "venv/lib/python3.7/site-packages/gensim/sklearn_api/d2vmodel.py", line 162, in fit
    if isinstance(X[0], doc2vec.TaggedDocument):
  File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4375, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

This occurs because the fit and transform methods of D2VTransformer require __getitem__ on the passed iterable not to raise an exception for key 0.

Versions:

Darwin-18.6.0-x86_64-i386-64bit
Python 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-10-08T08:43:34Z

TODO: check other sklearn_api models too – same issue there as in D2VTransformer?

Hiyorimi · 2019-10-09T11:11:23Z

So I need to check here for the first element, not element at index 0?
Will that be enough to close the issue?

Hiyorimi · 2019-10-10T00:53:53Z

@piskvorky ready for review and, hopefully, a merge.

mpenkov · 2019-10-12T06:37:55Z

@Hiyorimi just looking at this code again, I've realized we're making a copy of a list. This is unnecessary:

        if isinstance([i for i in X[:1]][0], doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

You could do something like:

def _get_first(some_list):
    for elem in some_list:
        return elem

...

        if isinstance(_get_first(X), doc2vec.TaggedDocument):
            d2v_sentences = X
            d2v_sentences = X

WDYT?

Hiyorimi · 2019-10-12T21:04:57Z

Are you sure that it is better rather than making a copy of a slice?

mpenkov · 2019-10-13T08:33:04Z

Not 100%. Making a copy of the list seems wasteful, though. Do you disagree?

Hiyorimi · 2019-10-13T18:29:50Z

It is 1 line + copy of 1 element
vs
3 lines of function

I have no clue what is better here.

piskvorky · 2019-10-13T18:50:00Z

The standard way to get the first element of a repeatable iterable is next(iter(x)).

For non-repeatable generators, it's a bit more complicated, because we must put the "peeked" first element back after peeking. Otherwise we would change x itself.

What is the type of X here?

[i for i in X[:1]][0] is too opaque, I'm +1 on expressing the logic more clearly.

Hiyorimi · 2019-10-14T18:08:45Z

Since

>>> a = pd.DataFrame([[1,2], [2,4]], columns=['1', '2'])
>>> a.index += 1
>>> _get_first = lambda X: next(iter(X))
>>> _get_first(a)
'1'

I just added method.

Hiyorimi · 2019-10-14T18:15:39Z

Opened a PR

mpenkov added the bug Issue described a bug label Jul 21, 2019

mpenkov self-assigned this Sep 28, 2019

piskvorky added difficulty easy Easy issue: required small fix Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users labels Oct 8, 2019

Hiyorimi added a commit to Hiyorimi/gensim that referenced this issue Oct 10, 2019

Handling for iterables without 0-th element, fixes piskvorky#2556

b8dce4f

mpenkov closed this as completed in 289a6ca Oct 10, 2019

mpenkov reopened this Oct 12, 2019

mpenkov closed this as completed in 8624aa2 Nov 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

D2VTransformer raises if passed a Pandas series without index key 0 #2556

D2VTransformer raises if passed a Pandas series without index key 0 #2556

rpetchler commented Jul 14, 2019

piskvorky commented Oct 8, 2019 •

edited

Loading

Hiyorimi commented Oct 9, 2019

Hiyorimi commented Oct 10, 2019

mpenkov commented Oct 12, 2019

Hiyorimi commented Oct 12, 2019

mpenkov commented Oct 13, 2019

Hiyorimi commented Oct 13, 2019

piskvorky commented Oct 13, 2019 •

edited

Loading

Hiyorimi commented Oct 14, 2019

Hiyorimi commented Oct 14, 2019

D2VTransformer raises if passed a Pandas series without index key 0 #2556

D2VTransformer raises if passed a Pandas series without index key 0 #2556

Comments

rpetchler commented Jul 14, 2019

piskvorky commented Oct 8, 2019 • edited Loading

Hiyorimi commented Oct 9, 2019

Hiyorimi commented Oct 10, 2019

mpenkov commented Oct 12, 2019

Hiyorimi commented Oct 12, 2019

mpenkov commented Oct 13, 2019

Hiyorimi commented Oct 13, 2019

piskvorky commented Oct 13, 2019 • edited Loading

Hiyorimi commented Oct 14, 2019

Hiyorimi commented Oct 14, 2019

piskvorky commented Oct 8, 2019 •

edited

Loading

piskvorky commented Oct 13, 2019 •

edited

Loading