Skip to content

HDFStore.append fails when appending dataframe with empty string column for which min_itemsize < 8 #12242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amcpherson opened this issue Feb 6, 2016 · 4 comments · Fixed by #23435
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@amcpherson
Copy link
Contributor

Reproduced as follows:

In [3]: store = pd.HDFStore('teststore.h5', 'w')

In [4]: chunk = pd.DataFrame({'V1':['a','b','c','d','e'], 'data':np.arange(5)})

In [5]: store.append('df', chunk, min_itemsize={'V1': 4})

In [6]: chunk = pd.DataFrame({'V1':['', ''], 'data': [3, 5]})

In [7]: store.append('df', chunk, min_itemsize={'V1': 4})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-c9bafa18ead0> in <module>()
----> 1 store.append('df', chunk, min_itemsize={'V1': 4})

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in append(self, key, value, format, append, columns, dropna, **kwargs)
    905         kwargs = self._validate_format(format, kwargs)
    906         self._write_to_group(key, value, append=append, dropna=dropna,
--> 907                              **kwargs)
    908
    909     def append_to_multiple(self, d, value, selector, data_columns=None,

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1250
   1251         # write the object
-> 1252         s.write(obj=value, append=append, complib=complib, **kwargs)
   1253
   1254         if s.is_table and index:

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3755         self.create_axes(axes=axes, obj=obj, validate=append,
   3756                          min_itemsize=min_itemsize,
-> 3757                          **kwargs)
   3758
   3759         for a in self.axes:

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3432                 self.values_axes.append(col)
   3433             except (NotImplementedError, ValueError, TypeError) as e:
-> 3434                 raise e
   3435             except Exception as detail:
   3436                 raise Exception(

ValueError: Trying to store a string with len [8] in [V1] column but
this column has a limit of [4]!
Consider using min_itemsize to preset the sizes on these columns

Does not raise unless all values in a column are empty strings. A workaround is to set min_itemsize to 8 or higher.

Version information:

In [10]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None
@TomAugspurger
Copy link
Contributor

So it looks like this comes from np.asarray doing something strange with length 0 fixed-width strings:

ipdb> data
array([[b'', b'']], dtype=object)
ipdb> n
> /Users/tom.augspurger/Envs/dev/lib/python3.4/site-packages/pandas/pandas/io/pytables.py(4453)_convert_string_array()
   4451
   4452     data = np.asarray(data, dtype="S%d" % itemsize)  # itemsize is 0
-> 4453     return data
   4454
   4455

ipdb> data
array([[b'', b'']],
      dtype='|S8')   # note the length 8

The array constructor converts to length 1 strings though:

In [12]: np.array([b'', b''], dtype='S0')
Out[12]:
array([b'', b''],
      dtype='|S1')   # length 1

Looks like a numpy bug maybe? Either way, we can work around by changing this line to

        itemsize = max(1, lib.max_len_string_array(com._ensure_object(data.ravel())))

@amcpherson thanks for the report. Are you interested in submitting a Pull Request with that fix and a test?

@TomAugspurger TomAugspurger added the IO HDF5 read_hdf, HDFStore label Feb 6, 2016
@TomAugspurger TomAugspurger modified the milestones: 0.18.0, Next Major Release Feb 6, 2016
@amcpherson
Copy link
Contributor Author

i can do a pull request, cant guarantee how soon though
On Sat, Feb 6, 2016 at 8:02 AM Tom Augspurger notifications@github.com
wrote:

So it looks like this comes from np.asarray doing something strange with
length 0 fixed-width strings:

ipdb> data
array([[b'', b'']], dtype=object)
ipdb> n

/Users/tom.augspurger/Envs/dev/lib/python3.4/site-packages/pandas/pandas/io/pytables.py(4453)_convert_string_array()
4451
4452 data = np.asarray(data, dtype="S%d" % itemsize) # itemsize is 0
-> 4453 return data
4454
4455

ipdb> data
array([[b'', b'']],
dtype='|S8')

The array constructor converts to length 1 strings though:

In [12]: np.array([b'', b''], dtype='S0')
Out[12]:
array([b'', b''],
dtype='|S1')

Looks like a numpy bug maybe? Either way, we can work around by changing this
line
https://github.com/pydata/pandas/blob/6693a723aa2a8a53a071860a43804c173a7f92c6/pandas/io/pytables.py#L4450
to

    itemsize = max(1, lib.max_len_string_array(com._ensure_object(data.ravel())))

@amcpherson https://github.com/amcpherson thanks for the report. Are
you interested in submitting a Pull Request with that fix and a test?


Reply to this email directly or view it on GitHub
#12242 (comment).

@finkelm
Copy link

finkelm commented Apr 30, 2017

This still hasn't been factored in.

@jreback jreback closed this as completed Apr 30, 2017
@jreback jreback reopened this Apr 30, 2017
@jreback
Copy link
Contributor

jreback commented Apr 30, 2017

This still hasn't been factored in.

hence the open issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants