Skip to content

BUG: Categorical data fails to load from hdf when all columns are NaN #18413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssche opened this issue Nov 21, 2017 · 6 comments · Fixed by #18652
Closed

BUG: Categorical data fails to load from hdf when all columns are NaN #18413

ssche opened this issue Nov 21, 2017 · 6 comments · Fixed by #18652
Labels
Bug Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Milestone

Comments

@ssche
Copy link
Contributor

ssche commented Nov 21, 2017

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', np.nan], 'b': [np.nan, np.nan, np.nan, np.nan]})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')

df.to_hdf('foo.h5', 'bar', format='table')
pd.read_hdf('foo.h5', 'bar')

Problem description

While storing an hdf file with categorical data containing np.nans works fine, loading the file back in to a DataFrame raises an exception.

  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 372, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 742, in select
    return it.get_result()
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 1449, in get_result
    results = self.func(self.start, self.stop, where)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 735, in func
    columns=columns, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 4124, in read
    if not self.read_axes(where=where, **kwargs):
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 3329, in read_axes
    a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 2133, in convert
    if mask.any():
AttributeError: 'bool' object has no attribute 'any'

This exception is related to storing a column that contains np.nan values only (column a stores and loads fine on its own).

The problem could already be in the way the metadata for column b (the np.nan-only column) is stored when calling df.to_hdf() as the metadata is None when loading. The relevant code for the pd.read_hdf() in DataCol.convert:

elif meta == u('category'):

                # we have a categorical
                categories = self.metadata
                codes = self.data.ravel()

                # if we have stored a NaN in the categories
                # then strip it; in theory we could have BOTH
                # -1s in the codes and nulls :<
                mask = isnull(categories)
                if mask.any():
                    categories = categories[~mask]
                    codes[codes != -1] -= mask.astype(int).cumsum().values

has metadata set to None (self.metadata == categories == None) which in turn makes mask (=isnull(None)) a scalar value (False) and thus .any() fails.

Expected Output

No exception; dataframe loads as it would without categorical data.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.16.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

@ssche
Copy link
Contributor Author

ssche commented Nov 26, 2017

Has anyone had a chance to look at this? Can it be reproduced? If so, is there a workaround or some suggestions?

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

looks like a bug. The issue is the categories are an empty array; when this is stored, pytables cannot write a zero-len array, so on readback the categories are None.

In [2]: pd.read_hdf('foo.h5', 'bar')
Out[2]: 
     a    b
0    a  NaN
1    b  NaN
2    c  NaN
3  NaN  NaN

In [3]: pd.read_hdf('foo.h5', 'bar').dtypes
Out[3]: 
a    category
b    category
dtype: object

patch

diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py
index 2a66aea..2ae3765 100644
--- a/pandas/io/pytables.py
+++ b/pandas/io/pytables.py
@@ -2137,10 +2137,13 @@ class DataCol(IndexCol):
                 # if we have stored a NaN in the categories
                 # then strip it; in theory we could have BOTH
                 # -1s in the codes and nulls :<
-                mask = isna(categories)
-                if mask.any():
-                    categories = categories[~mask]
-                    codes[codes != -1] -= mask.astype(int).cumsum().values
+                if categories is None:
+                    categories = []
+                else:
+                    mask = isna(categories)
+                    if mask.any():
+                        categories = categories[~mask]
+                        codes[codes != -1] -= mask.astype(int).cumsum().values
 
                 self.data = Categorical.from_codes(codes,
                                                    categories=categories,

if you can submit a PR would be great.

@jreback jreback added Bug Categorical Categorical Data Type Difficulty Novice IO HDF5 read_hdf, HDFStore labels Nov 26, 2017
@jreback jreback added this to the Next Major Release milestone Nov 26, 2017
@jreback jreback changed the title Categorical data fails to load from hdf when all columns are NaN BUG: Categorical data fails to load from hdf when all columns are NaN Nov 26, 2017
@ssche
Copy link
Contributor Author

ssche commented Dec 5, 2017

Ok, I tried, but am struggling with a new test case which is not passing. It seems to be related to being able to store NaN-only categorical columns now (only fails when including column 'b' in the test case test_categorical_nan_only_columns). I can't see where the axes are different... Any help would be appreciated.

______________________________________________ TestHDFStore.test_categorical_nan_only_columns ______________________________________________

self = <pandas.tests.io.test_pytables.TestHDFStore object at 0x7f844c87ff90>

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table', data_columns=True)
            result = read_hdf(path, 'df')
            print 'result', result.dtypes
            print 'expected', expected.dtypes
>           tm.assert_frame_equal(result, expected)

pandas/tests/io/test_pytables.py:4947: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/util/testing.py:1390: in assert_frame_equal
    obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
pandas/util/testing.py:1235: in assert_series_equal
    assert_attr_equal('dtype', left, right)
pandas/util/testing.py:1001: in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'Attributes', message = 'Attribute "dtype" are different', left = 'CategoricalDtype(categories=[], ordered=False)'
right = 'CategoricalDtype(categories=[], ordered=False)', diff = None

    def raise_assert_detail(obj, message, left, right, diff=None):
        if isinstance(left, np.ndarray):
            left = pprint_thing(left)
        elif is_categorical_dtype(left):
            left = repr(left)
        if isinstance(right, np.ndarray):
            right = pprint_thing(right)
        elif is_categorical_dtype(right):
            right = repr(right)
    
        msg = """{obj} are different
    
    {message}
    [left]:  {left}
    [right]: {right}""".format(obj=obj, message=message, left=left, right=right)
    
        if diff is not None:
            msg += "\n[diff]: {diff}".format(diff=diff)
    
>       raise AssertionError(msg)
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[], ordered=False)
E       [right]: CategoricalDtype(categories=[], ordered=False)

pandas/util/testing.py:1086: AssertionError

@ssche
Copy link
Contributor Author

ssche commented Dec 5, 2017

Actually, I found a difference (there may be more).

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table')
            result = read_hdf(path, 'df', data_columns=True)
            print 'result', result.b.dtype, type(result.b.dtype)
            print 'expected', expected.b.dtype, type(expected.b.dtype)
            print result.b.dtype == expected.b.dtype
            print result.b.dtype.categories
            print expected.b.dtype.categories
            tm.assert_frame_equal(result, expected, check_dtype=False)

shows

result category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
expected category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
False
Index([], dtype='object')
Float64Index([], dtype='float64')

Is this expected behaviour? Does it also need to be addressed and if so some pointer of where to look would be helpful...

@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

in the patch use Index([]) rather than [] for the empty categories
I think we just fixed this

@ssche
Copy link
Contributor Author

ssche commented Dec 6, 2017

I had to make it a Index([], dtype=np.float64) because that's the type that is read back from the hdf5 store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants