Skip to content

TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 26, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.16.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ Bug Fixes
- Fixed bug on bug endian platforms which produced incorrect results in ``StataReader`` (:issue:`8688`).

- Bug in ``MultiIndex.has_duplicates`` when having many levels causes an indexer overflow (:issue:`9075`, :issue:`5873`)
- Bug in ``pivot`` and `unstack`` where ``nan`` values would break index alignment (:issue:`7466`)
- Bug in ``pivot`` and `unstack`` where ``nan`` values would break index alignment (:issue:`4862`, :issue:`7401`, :issue:`7403`, :issue:`7405`, :issue:`7466`)
- Bug in left ``join`` on multi-index with ``sort=True`` or null values (:issue:`9210`).
- Bug in ``MultiIndex`` where inserting new keys would fail (:issue:`9250`).

Expand Down
38 changes: 22 additions & 16 deletions pandas/core/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import pandas.core.common as com
import pandas.algos as algos

from pandas.core.index import MultiIndex, _get_na_value
from pandas.core.index import MultiIndex


class _Unstacker(object):
Expand Down Expand Up @@ -198,14 +198,8 @@ def get_new_values(self):

def get_new_columns(self):
if self.value_columns is None:
if self.lift == 0:
return self.removed_level

lev = self.removed_level
vals = np.insert(lev.astype('object'), 0,
_get_na_value(lev.dtype.type))

return lev._shallow_copy(vals)
return _make_new_index(self.removed_level, None) \
if self.lift != 0 else self.removed_level

stride = len(self.removed_level) + self.lift
width = len(self.value_columns)
Expand All @@ -232,19 +226,31 @@ def get_new_index(self):
# construct the new index
if len(self.new_index_levels) == 1:
lev, lab = self.new_index_levels[0], result_labels[0]
if not (lab == -1).any():
return lev.take(lab)

vals = np.insert(lev.astype('object'), len(lev),
_get_na_value(lev.dtype.type)).take(lab)

return lev._shallow_copy(vals)
return _make_new_index(lev, lab) \
if (lab == -1).any() else lev.take(lab)

return MultiIndex(levels=self.new_index_levels,
labels=result_labels,
names=self.new_index_names,
verify_integrity=False)


def _make_new_index(lev, lab):
from pandas.core.index import Index, _get_na_value

nan = _get_na_value(lev.dtype.type)
vals = lev.values.astype('object')
vals = np.insert(vals, 0, nan) if lab is None else \
np.insert(vals, len(vals), nan).take(lab)

try:
vals = vals.astype(lev.dtype, subok=False, copy=False)
except ValueError:
return Index(vals, **lev._get_attributes_dict())

return lev._shallow_copy(vals)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_make_new_index(lev, lab) doesn't seem to handle the dtype properly for a DatetimeIndex when lab contains -1:

In : idx = pd.Index([datetime.now()])
Out:
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12 16:00:12.487944]
Length: 1, Freq: None, Timezone: None

In : _make_new_index(idx, None)  # doesn't convert back to DatetimeIndex
Out: Index([NaT, 1423756812487944000], dtype='object')

In : _make_new_index(idx, [0])  # properly converts back to DatetimeIndex
Out:
[2015-02-12 16:00:12.487944]
Length: 1, Freq: None, Timezone: None

In : _make_new_index(idx, [0, -1])  # doesn't convert back to DatetimeIndex
Out: Index([1423756812487944000, NaT], dtype='object')

In [309]: Index([datetime.now(), None])  # for comparison
Out[309]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12 16:17:09.616120, NaT]
Length: 2, Freq: None, Timezone: None

I need this in order to get stack() to work properly with NaN/NaT in indices for #9023.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seth-p I think this last is prob a bug, pls open a new report (with your example above is good). and we can fix independently (and you can rebase on top)

on 2nd thought, you can do as part of #9023 as this is an internal method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can fix it I will (as part of #9023), otherwise will punt back to @behzadnouri. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think adding the following code just before the try fixes _make_new_index(lev, lab). I will include it in #9023.

    if lev.dtype.type == np.datetime64:
        nan_indices = [0] if lab is None else (np.array(lab) == -1)
        vals[nan_indices] = None


def _unstack_multiple(data, clocs):
if len(clocs) == 0:
return data
Expand Down
106 changes: 106 additions & 0 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -12328,6 +12328,25 @@ def test_unstack_dtypes(self):
expected = Series({'float64' : 2, 'object' : 2})
assert_series_equal(result, expected)

# GH7405
for c, d in (np.zeros(5), np.zeros(5)), \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you specify dtypes to the np.zeros constructor? (otherwise you have a platform comparison issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think np.zeros requires dtypes here, because left and right frame are both defined off the df frame here. i would rather keep this replicate of #7405. also, in the 2nd iteration of for loop there is np.arange with explicit data type.

(np.arange(5, dtype='f8'), np.arange(5, 10, dtype='f8')):

df = DataFrame({'A': ['a']*5, 'C':c, 'D':d,
'B':pd.date_range('2012-01-01', periods=5)})

right = df.iloc[:3].copy(deep=True)

df = df.set_index(['A', 'B'])
df['D'] = df['D'].astype('int64')

left = df.iloc[:3].unstack(0)
right = right.set_index(['A', 'B']).unstack(0)
right[('D', 'a')] = right[('D', 'a')].astype('int64')

self.assertEqual(left.shape, (3, 2))
tm.assert_frame_equal(left, right)

def test_unstack_non_unique_index_names(self):
idx = MultiIndex.from_tuples([('a', 'b'), ('c', 'd')],
names=['c1', 'c1'])
Expand Down Expand Up @@ -12385,6 +12404,93 @@ def verify(df):
for col in ['4th', '5th']:
verify(udf[col])

# GH7403
df = pd.DataFrame({'A': list('aaaabbbb'),'B':range(8), 'C':range(8)})
df.iloc[3, 1] = np.NaN
left = df.set_index(['A', 'B']).unstack(0)

vals = [[3, 0, 1, 2, nan, nan, nan, nan],
[nan, nan, nan, nan, 4, 5, 6, 7]]
vals = list(map(list, zip(*vals)))
idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be obvious (at least it wasn't to me), but idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B') results in Float64Index([nan, 0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0], dtype='float64') -- i.e. the integer levels (i.e. what the user sees) get converted to floats. Thus you are testing that when unstacking a MultiIndex that leaves a single level with integer labels and NaNs, the result is a Float64Index. This is pretty much unavoidable -- unless you explicitly cast the integers to objects rather than float64 -- as MultIndex doesn't seem to support a single level. (Note that an integer MultiIndex level can have missing values, since they are represented by -1 in the labels and not NaN in the labels.) I think either (a) MultIndex should support a single level (and not convert to an Index), or (b) the unstacking code should convert such an integer-with-NaN index to object rather than float64. [I encountered this issue while trying to make the stack() code handle NaN labels propertly.]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you explicitly cast the integers to objects rather than float64

I actually explicitly cast to objects not float64. the index comes out as float because of type inference in Index.__new__, not in unstacking code.

>>> i
array([nan, 1, 2], dtype=object)
>>> type(i[1])
<class 'int'>
>>> Index(i)
Float64Index([nan, 1.0, 2.0], dtype='float64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to specify a dtype=object when constructing an index to not have it coerce (if u want it to explicitly preserve it)
though can't think of a reason you would ever actually want to do this

if u really mean integer then we use -1 as a sentinel in the indexers and they are left as ints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri, yes, precisely, here the conversion of ints to foats is done by Index.__new__. However since this is being compared to the results of left = df.set_index(['A', 'B']).unstack(0), clearly the unstack() code is also doing such a conversion.

@jreback, understood. Two points:

First, other than specifying dtype='object', there doesn't seem to be a way to construct a (single-level) index of _int_s with NaNs. In a MultiIndex with more than one level, a level can be int with NaNs, but a MultiIndex with a single level gets converted automatically to an Index, and in the case of an int level with NaNs, a Float64Index.

In [27]: pd.Index([nan, 1, 2])  # note automatic conversion to Float64Index
Out[27]: Float64Index([nan, 1.0, 2.0], dtype='float64')

In [29]: pd.MultiIndex.from_tuples([(nan,), (1,), (2,)])  # note automatic conversion to Float64Index
Out[29]: Float64Index([nan, 1.0, 2.0], dtype='float64')

In [30]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')])
Out[30]:
MultiIndex(levels=[[1, 2], ['a']],
           labels=[[-1, 0, 1], [0, 0, 0]])

In [31]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')]).levels[0]
Out[31]: Int64Index([1, 2], dtype='int64')  # note first level is Int64Index

Second, my point is that in the course of stacking and/or unstacking (and perhaps other operations), a single level in a multi-level MultiIndex can be "promoted" to be an Index unto itself, and in that case seems (at least currently in the case of stack() and unstack()) to be changed from ints to floats, which I don't think is a good thing to do to index values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri, just to clarify, I don't doubt that within the unstacking code the conversion from ints to floats happens in Index.__init__(), but that doesn't change the fact that in such cases unstack() has the effect of converting ints to floats.

The reason I think this is a potential problem (which predates you, I'm sure, and is also present in stack()), is that if someone then wants to compare the values in the Index of the stacked/unstacked object to the values in the corresponding level of MultiIndex of the original DataFrame, they won't match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seth-p I think the unstack/stack code need to explicity pass the dtype into when the Index is constructed.

Currently you cannot have nans in an Int64Index, so unless you mark it as object it will by definition be coerced. A multi-index can represent this, but single level multi-indexes are not supported as that is just added complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what you want the behavior to be. Suppose df.index is a 2-level MultiIndex, the first level of which is (effectively) an Int64Index with missing values (i.e. -1 in labels), and that you then unstack the second level (i.e. df.unstack(1)). What should the result index be? There are two options: (a) Float64Index, and (b) Index(..., dtype='object'). With this PR, the result is (a). Personally I think (b) is preferable.

In : df = pd.DataFrame([[11,22],[33,44]], index=pd.MultiIndex.from_tuples([(1, 'a'), (None, 'b')], names=['ints', 'letters']))
In : df.index.levels[0]
Out: Int64Index([1], dtype='int64')
In : df.index.labels[0]
Out: FrozenNDArray([0, -1], dtype='int8')

In : result = df.unstack(1)
In : result.index
Out: Float64Index([nan, 1.0], dtype='float64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, to see in general how messed up things are when have NaNs in indices, consider the following:

In [698]: idx = MultiIndex.from_tuples([(1, 'a'), (np.nan, 'b')])

In [699]: idx.levels[0]
Out[699]: Int64Index([1], dtype='int64')

In [700]: idx.get_values()
Out[700]: array([(1.0, 'a'), (nan, 'b')], dtype=object)

In [701]: idx[:1].get_values()
Out[701]: array([(1, 'a')], dtype=object)

Why does idx.get_values() convert the int 1 to a float 1.0 just because there's a NaN somewhere else? (I can guess what the implementation is that would lead to this...) Is this a bug?

cols = MultiIndex(levels=[['C'], ['a', 'b']],
labels=[[0, 0], [0, 1]],
names=[None, 'A'])

right = DataFrame(vals, columns=cols, index=idx)
assert_frame_equal(left, right)

df = DataFrame({'A': list('aaaabbbb'), 'B':list(range(4))*2,
'C':range(8)})
df.iloc[2,1] = np.NaN
left = df.set_index(['A', 'B']).unstack(0)

vals = [[2, nan], [0, 4], [1, 5], [nan, 6], [3, 7]]
cols = MultiIndex(levels=[['C'], ['a', 'b']],
labels=[[0, 0], [0, 1]],
names=[None, 'A'])
idx = Index([nan, 0, 1, 2, 3], name='B')
right = DataFrame(vals, columns=cols, index=idx)
assert_frame_equal(left, right)

df = pd.DataFrame({'A': list('aaaabbbb'),'B':list(range(4))*2,
'C':range(8)})
df.iloc[3,1] = np.NaN
left = df.set_index(['A', 'B']).unstack(0)

vals = [[3, nan], [0, 4], [1, 5], [2, 6], [nan, 7]]
cols = MultiIndex(levels=[['C'], ['a', 'b']],
labels=[[0, 0], [0, 1]],
names=[None, 'A'])
idx = Index([nan, 0, 1, 2, 3], name='B')
right = DataFrame(vals, columns=cols, index=idx)
assert_frame_equal(left, right)

# GH7401
df = pd.DataFrame({'A': list('aaaaabbbbb'), 'C':np.arange(10),
'B':date_range('2012-01-01', periods=5).tolist()*2 })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add addtl test that has a tz and freq attached? (or 2 addtl tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is tested in test_pivot_with_tz


df.iloc[3,1] = np.NaN
left = df.set_index(['A', 'B']).unstack()

vals = np.array([[3, 0, 1, 2, nan, 4], [nan, 5, 6, 7, 8, 9]])
idx = Index(['a', 'b'], name='A')
cols = MultiIndex(levels=[['C'], date_range('2012-01-01', periods=5)],
labels=[[0, 0, 0, 0, 0, 0], [-1, 0, 1, 2, 3, 4]],
names=[None, 'B'])

right = DataFrame(vals, columns=cols, index=idx)
assert_frame_equal(left, right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here both left and right have dtypes = [float64]*6. But since df.set_index(['A', 'B']) has dtypes = [int32], I would expect only the resulting columns with NaN to be changed to float64,i.e. I would expect dtypes = [float64, int32, int32, int32, float64, int32].

I noticed this because when testing my code in https://github.com/pydata/pandas/pull/9023/files, when I change the implementation of DataFrame.unstack() to be based on DataFrame.stack(), my code "corrects" the columns back to int32 where possible, whereas this test expects all columns to be float64.


# GH4862
vals = [['Hg', nan, nan, 680585148],
['U', 0.0, nan, 680585148],
['Pb', 7.07e-06, nan, 680585148],
['Sn', 2.3614e-05, 0.0133, 680607017],
['Ag', 0.0, 0.0133, 680607017],
['Hg', -0.00015, 0.0133, 680607017]]
df = DataFrame(vals, columns=['agent', 'change', 'dosage', 's_id'],
index=[17263, 17264, 17265, 17266, 17267, 17268])

left = df.copy().set_index(['s_id','dosage','agent']).unstack()

vals = [[nan, nan, 7.07e-06, nan, 0.0],
[0.0, -0.00015, nan, 2.3614e-05, nan]]

idx = MultiIndex(levels=[[680585148, 680607017], [0.0133]],
labels=[[0, 1], [-1, 0]],
names=['s_id', 'dosage'])

cols = MultiIndex(levels=[['change'], ['Ag', 'Hg', 'Pb', 'Sn', 'U']],
labels=[[0, 0, 0, 0, 0], [0, 1, 2, 3, 4]],
names=[None, 'agent'])

right = DataFrame(vals, columns=cols, index=idx)
assert_frame_equal(left, right)

left = df.ix[17264:].copy().set_index(['s_id','dosage','agent'])
assert_frame_equal(left.unstack(), right)

def test_stack_datetime_column_multiIndex(self):
# GH 8039
t = datetime(2014, 1, 1)
Expand Down
1 change: 0 additions & 1 deletion pandas/tests/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -5954,7 +5954,6 @@ def test_unstack(self):
idx = pd.MultiIndex.from_arrays([[101, 102], [3.5, np.nan]])
ts = pd.Series([1,2], index=idx)
left = ts.unstack()
left.columns = left.columns.astype('float64')
right = DataFrame([[nan, 1], [2, nan]], index=[101, 102],
columns=[nan, 3.5])
assert_frame_equal(left, right)
Expand Down