-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing HDF5 with StringDtype columns fails with AttributeError: 'StringArray' object has no attribute 'size' #31199
Comments
there is no support for the new extension dtypes in hdf5 - we would take pull requests to add this for format=‘table’ this could work (fixed has no support at all except for basic types) it’s not that hard |
I'm a new pandas user so I'm not sure if I'm confident tackling this right now. It looks like serialising to and from feather format is more suitable for me. |
Is there any progress? |
progress happens when people submit PRs |
I found that I could save and restore the series just by setting dtype=object for the relevant columns, for example:
I am currently doing this with large dataframes that include a few string columns and it's worked out so far. Also, here is a more barebones demonstration of this error...in case it's useful to anyone
Output of DetailsINSTALLED VERSIONcommit : None pandas : 1.0.3 |
@ellequelle Thanks for the information. I just want to know that there is an improvement (in terms of time or space) when I explicitly assign the type of column to string by exploiting this. |
Code Sample, a copy-pastable example if possible
Problem description
The code fails to write the HDF5. It fails with an exception
AttributeError: 'StringArray' object has no attribute 'size'
:It does write a file
/tmp/fubar.h5
, but pandas cannot read it. Commenting out the writing bit and leaving only the reading bit results in:Expected Output
I expect a well-formed HDF5 file to be written, and the
pandas.read_hdf
command to return aDataFrame
identical to the one written to disk just before.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp150.12.82-default
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.0.0rc0
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.14
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
s3fs : 0.4.0
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0
The text was updated successfully, but these errors were encountered: