-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Categorical data fails to load from hdf when all columns are NaN #18413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Has anyone had a chance to look at this? Can it be reproduced? If so, is there a workaround or some suggestions? |
looks like a bug. The issue is the categories are an empty array; when this is stored, pytables cannot write a zero-len array, so on readback the categories are
patch
if you can submit a PR would be great. |
Ok, I tried, but am struggling with a new test case which is not passing. It seems to be related to being able to store NaN-only categorical columns now (only fails when including column 'b' in the test case
|
Actually, I found a difference (there may be more). def test_categorical_nan_only_columns(self):
# GH18413
# Check that read_hdf with categorical columns with NaN-only values can
# be read back.
df = pd.DataFrame({
'a': ['a', 'b', 'c', np.nan],
'b': [np.nan, np.nan, np.nan, np.nan],
'c': [1, 2, 3, 4]
})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')
expected = df.copy()
with ensure_clean_path(self.path) as path:
df.to_hdf(path, 'df', format='table')
result = read_hdf(path, 'df', data_columns=True)
print 'result', result.b.dtype, type(result.b.dtype)
print 'expected', expected.b.dtype, type(expected.b.dtype)
print result.b.dtype == expected.b.dtype
print result.b.dtype.categories
print expected.b.dtype.categories
tm.assert_frame_equal(result, expected, check_dtype=False) shows
Is this expected behaviour? Does it also need to be addressed and if so some pointer of where to look would be helpful... |
in the patch use Index([]) rather than [] for the empty categories |
I had to make it a |
Code Sample, a copy-pastable example if possible
Problem description
While storing an hdf file with categorical data containing
np.nan
s works fine, loading the file back in to a DataFrame raises an exception.This exception is related to storing a column that contains
np.nan
values only (columna
stores and loads fine on its own).The problem could already be in the way the
metadata
for columnb
(thenp.nan
-only column) is stored when callingdf.to_hdf()
as themetadata
isNone
when loading. The relevant code for thepd.read_hdf()
inDataCol.convert
:has metadata set to None (
self.metadata == categories == None
) which in turn makes mask (=isnull(None)
) a scalar value (False
) and thus.any()
fails.Expected Output
No exception; dataframe loads as it would without categorical data.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None
pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.16.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: