Skip to content

BUG: PeriodIndex construction fails (or produces erroneous result) with memoryview as input. #13120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssanderson opened this issue May 9, 2016 · 14 comments · Fixed by #41431
Labels
Constructors Series/DataFrame/Index/pd.array Constructors good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ssanderson
Copy link
Contributor

Code Sample, a copy-pastable example if possible

First noted in discussion here: https://github.com/quantopian/zipline/pull/1190/files/fe5a2a888a498d838c3cc43de2c33ac08b20d2a7#r62506342. This most often matters when passing a value typed in Cython as something like int64_t[:].

Simple repro cases:

DatetimeIndex construction fails (looks like trying to parse strings?), and vanilla Index construction returns an array of unicode strings.

In [6]: import pandas as pd; import numpy as np

In [7]: pd.__version__, np.__version__
Out[7]: ('0.18.1+20.gaf7bdd3', '1.11.0')

In [8]: m = memoryview(np.arange(5))

In [9]: pd.DatetimeIndex(m)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-09b855a663a4> in <module>()
----> 1 pd.DatetimeIndex(m)

/home/ssanderson/clones/pandas/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

/home/ssanderson/clones/pandas/pandas/tseries/index.pyc in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
    283                 data = tslib.parse_str_array_to_datetime(data, freq=freq,
    284                                                          dayfirst=dayfirst,
--> 285                                                          yearfirst=yearfirst)
    286             else:
    287                 data = tools.to_datetime(data, errors='raise')

/home/ssanderson/clones/pandas/pandas/tslib.pyx in pandas.tslib.parse_str_array_to_datetime (pandas/tslib.c:39610)()
   2175                         except ValueError:
   2176                             if is_coerce:
-> 2177                                 iresult[i] = NPY_NAT
   2178                                 continue
   2179                             raise

ValueError:

In [10]: pd.Index(m)
Out[10]: Index([u'', u'', u'', u'', u''], dtype='object')

Expected Output

I'd expect this to either error immediately or provide the same result as coercing the provided memoryview into a numpy array.

output of pd.show_versions()

In [6]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: af7bdd3883c8d61e9d9388d3aa699930eee7fff8
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+20.gaf7bdd3
nose: None
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.24
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

cc @jbredeche

@jreback
Copy link
Contributor

jreback commented May 9, 2016

I suppose. You are much better off just doing np.array(m) before sending this in. If you want to push a PR which checks if its a memory buffer then ok.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

note that _simple_new methods are NOT public, you should really really not sue internal/private things.

@jreback jreback added Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels May 9, 2016
@jreback
Copy link
Contributor

jreback commented May 9, 2016

We prob don't handle construction for Series/DataFrame with a memory buffer either.

@jreback jreback added this to the Next Major Release milestone May 9, 2016
@ssanderson
Copy link
Contributor Author

You are much better off just doing np.array(m) before sending this in

This is what we're currently doing (and what we'll probably continue to do for a while since we support older versions of pandas).

note that _simple_new methods are NOT public, you should really really not sue internal/private things.

Yep, I assumed as much. What would you think about making public versions of some of the Pandas object constructors that accept limited, specific arguments and don't do any coercion? We frequently have the problem in Zipline that we have an object of a well-known type and we want to construct an Index/Series/DataFrame without incurring the extra overhead of Pandas having to infer the type of our input.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@ssanderson not sure what you mean. The inference is generally very light so you shouldn't incur costs with anything. If you have a specific example I can look.

@ssanderson
Copy link
Contributor Author

If you want to push a PR which checks if its a memory buffer then ok.

Would the preferred behavior here be to coerce to ndarray, or to treat this as an error and barf? It seems like Pandas' general philosophy is to try its best to do reasonable coercions when possible.

@ssanderson
Copy link
Contributor Author

The inference is generally very light so you shouldn't incur costs with anything. If you have a specific example I can look.

The case that comes to mind offhand is something like DataFrame.from_records when reading values out of a tabular database. In that case, we know that all our records will have the same columns and dtypes, but there isn't a good way to tell pandas that information.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@ssanderson no you could certainly add it as a clause somewhere here: https://github.com/pydata/pandas/blob/master/pandas/indexes/base.py#L124

easy enough to add a:

def coerce_memory_view_to_ndarray(data):
    if isinstance(data, memoryview):
           data = np.array(data)
   return data

though better to put in the if/else's if possible (as it reduces the checks needed) e.g. right before the scalar check.

        elif isinstance(data, memoryview)
               return cls(np.array(data), ......)

@jreback
Copy link
Contributor

jreback commented May 9, 2016

btw .from_records is trivial, give it a structured array

In [9]: arr = np.zeros((2,), dtype=('i4,f4,a10'))

In [10]: arr[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [11]: arr
Out[11]: 
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', 'S10')])

In [12]: DataFrame.from_records(arr)
Out[12]: 
   f0   f1     f2
0   1  2.0  Hello
1   2  3.0  World

@ssanderson
Copy link
Contributor Author

btw .from_records is trivial, give it a structured array

TIL

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@ssanderson this is what all of the read-sql type things do (and pytables as well), these are row-structured arrays essentially.

@jbrockmendel
Copy link
Member

FWIW this works in master for DatetimeIndex and TimedeltaIndex (though not for PeriodIndex):

m = memoryview(np.arange(5))

dti = pd.DatetimeIndex(m)
tdi = pd.TimedeltaIndex(m)

pd.PeriodIndex(m, freq='Y')  # <-- pandas._libs.tslibs.parsing.DateParseError: Unable to parse datetime string: 
pd.Int64Index(m)   # <-- TypeError: String dtype not supported, you may need to explicitly cast to a numeric type
pd.Float64Index(m) # <-- TypeError: String dtype not supported, you may need to explicitly cast to a numeric type

>>> pd.Index(m)
Index([u'', u'', u'', u'', u''], dtype='object')

@jbrockmendel jbrockmendel added Constructors Series/DataFrame/Index/pd.array Constructors and removed Effort Medium labels Oct 16, 2019
@mroeschke mroeschke changed the title Index construction fails (or produces erroneous result) with memoryview as input. BUG: Period/Int/FloatIndex construction fails (or produces erroneous result) with memoryview as input. Mar 31, 2020
@mroeschke mroeschke added Index Related to the Index class or subclasses Numeric Operations Arithmetic, Comparison, and Logical operations Period Period data type and removed Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Datetime Datetime data dtype labels Mar 31, 2020
@mroeschke mroeschke changed the title BUG: Period/Int/FloatIndex construction fails (or produces erroneous result) with memoryview as input. BUG: PeriodIndex construction fails (or produces erroneous result) with memoryview as input. May 5, 2020
@mroeschke mroeschke added Bug and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels May 5, 2020
@mroeschke
Copy link
Member

Only fails with PeriodIndex now on master

In [25]: pd.Int64Index(m)
Out[25]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [26]: pd.Float64Index(m)
Out[26]: Float64Index([0.0, 1.0, 2.0, 3.0, 4.0], dtype='float64')

In [27]: pd.Index(m)
Out[27]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [28]: pd.PeriodIndex(m, freq='Y')
DateParseError: day is out of range for month

In [29]: pd.__version__
Out[29]: '1.1.0.dev0+1466.ga3477c769.dirty'

@mroeschke
Copy link
Member

Actually this fully works for a PeriodIndex. Could all use a test

In [1]: m = memoryview(np.arange(2000, 2005))

In [2]: pd.PeriodIndex(m, freq='Y')
Out[2]: PeriodIndex(['2000', '2001', '2002', '2003', '2004'], dtype='period[A-DEC]', freq='A-DEC')

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Constructors Series/DataFrame/Index/pd.array Constructors Index Related to the Index class or subclasses Period Period data type labels May 7, 2020
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Sep 22, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants