Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing HDF5 with StringDtype columns fails with AttributeError: 'StringArray' object has no attribute 'size' #31199

Open
gerritholl opened this issue Jan 22, 2020 · 6 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. IO HDF5 read_hdf, HDFStore Strings String extension data type and string data

Comments

@gerritholl
Copy link

Code Sample, a copy-pastable example if possible

import pandas
df = pandas.read_csv("s3://noaa-global-hourly-pds/2019/72047299999.csv",
        parse_dates=["DATE"],
        dtype=dict.fromkeys(["STATION", "NAME", "REPORT_TYPE",
            "QUALITY_CONTROL", "WND", "CIG", "VIS", "TMP", "DEW", "SLP", "AA1",
            "AA2", "AY1", "AY2", "GF1", "MW1", "REM"], pandas.StringDtype()))
df.to_hdf("/tmp/fubar.h5", "root")
pandas.read_hdf("/tmp/fubar.h5", "root")

Problem description

The code fails to write the HDF5. It fails with an exception AttributeError: 'StringArray' object has no attribute 'size':

Traceback (most recent call last):
  File "mwe15.py", line 7, in <module>
    df.to_hdf("/tmp/fubar.h5", "root")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/core/generic.py", line 2504, in to_hdf
    encoding=encoding,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 282, in to_hdf
    f(store)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 274, in <lambda>
    encoding=encoding,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1042, in put
    errors=errors,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1709, in _write_to_group
    data_columns=data_columns,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3100, in write
    self.write_array(f"block{i}_values", blk.values, items=blk_items)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2906, in write_array
    empty_array = value.size == 0
AttributeError: 'StringArray' object has no attribute 'size'

It does write a file /tmp/fubar.h5, but pandas cannot read it. Commenting out the writing bit and leaving only the reading bit results in:

Traceback (most recent call last):
  File "mwe15.py", line 8, in <module>
    pandas.read_hdf("/tmp/fubar.h5", "root")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 428, in read_hdf
    auto_close=auto_close,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 814, in select
    return it.get_result()
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1829, in get_result
    results = self.func(self.start, self.stop, where)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 798, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3068, in read
    blk_items = self.read_index(f"block{i}_items")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2752, in read_index
    variety = _ensure_decoded(getattr(self.attrs, f"{key}_variety"))
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/tables/attributeset.py", line 301, in __getattr__
    "'%s'" % (name, self._v__nodepath))
AttributeError: Attribute 'block0_items_variety' does not exist in node: '/root'
Closing remaining open files:/tmp/fubar.h5...done

Expected Output

I expect a well-formed HDF5 file to be written, and the pandas.read_hdf command to return a DataFrame identical to the one written to disk just before.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp150.12.82-default
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.0rc0
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.14
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
s3fs : 0.4.0
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0

@jreback
Copy link
Contributor

jreback commented Jan 22, 2020

there is no support for the new extension dtypes in hdf5 - we would take pull requests to add this

for format=‘table’ this could work (fixed has no support at all except for basic types)

it’s not that hard

@gerritholl
Copy link
Author

I'm a new pandas user so I'm not sure if I'm confident tackling this right now. It looks like serialising to and from feather format is more suitable for me.

@jbrockmendel jbrockmendel added IO HDF5 read_hdf, HDFStore ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Feb 25, 2020
@sh0416
Copy link

sh0416 commented Apr 28, 2020

Is there any progress?

@jreback
Copy link
Contributor

jreback commented Apr 28, 2020

progress happens when people submit PRs

@ellequelle
Copy link
Contributor

I found that I could save and restore the series just by setting dtype=object for the relevant columns, for example:

>>> df['string_column'] = df['string_column'].astype(object)
>>> df.to_hdf('test.hdf', 'test')

I am currently doing this with large dataframes that include a few string columns and it's worked out so far.

Also, here is a more barebones demonstration of this error...in case it's useful to anyone

>>> df = pd.DataFrame(list('abcd'), columns=['a'], dtype='string')
>>> df
     a
0    a
1    b
2    c
3    d
>>> df.to_hdf('test.hdf', key='test')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/core/generic.py", line 2505, in to_hdf
    encoding=encoding,
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 282, in to_hdf
    f(store)
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 274, in <lambda>
    encoding=encoding,
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 1042, in put
    errors=errors,
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 1709, in _write_to_group
    data_columns=data_columns,
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 3010, in write
    self.write_array("values", obj.values)
  File "~/anaconda3/envs/plots/lib/python3.7/site-packages/pandas/io/pytables.py", line 2907, in write_array
    empty_array = value.size == 0
AttributeError: 'StringArray' object has no attribute 'size'

Output of pd.show_versions():

Details

INSTALLED VERSION

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1
setuptools : 46.1.3.post20200325
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

@sh0416
Copy link

sh0416 commented Apr 30, 2020

@ellequelle Thanks for the information. I just want to know that there is an improvement (in terms of time or space) when I explicitly assign the type of column to string by exploiting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. IO HDF5 read_hdf, HDFStore Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants