Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH15943 Fixed defaults for compression in HDF5 #16355

Merged
merged 1 commit into from
Jun 13, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 51 additions & 13 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4066,26 +4066,64 @@ Compression
+++++++++++

``PyTables`` allows the stored data to be compressed. This applies to
all kinds of stores, not just tables.
all kinds of stores, not just tables. Two parameters are used to
control compression: ``complevel`` and ``complib``.

``complevel`` specifies if and how hard data is to be compressed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's change the defaults to complevel=None, complib=None.

If then complevel is not None and > 0 you can set complib to zlib (if it not defined)
if complib is not None and complevel is not 0 then you set the filter.

``complevel=0`` and ``complevel=None`` disables
compression and ``0<complevel<10`` enables compression.

``complib`` specifies which compression library to use. If nothing is
specified the default library ``zlib`` is used. A
compression library usually optimizes for either good
compression rates or speed and the results will depend on
the type of data. Which type of
compression to choose depends on your specific needs and
data. The list of supported compression libraries:

- `zlib <http://zlib.net/>`_: The default compression library. A classic in terms of compression, achieves good compression rates but is somewhat slow.
- `lzo <http://www.oberhumer.com/opensource/lzo/>`_: Fast compression and decompression.
- `bzip2 <http://bzip.org/>`_: Good compression rates.
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.

.. versionadded:: 0.20.2

Support for alternative blosc compressors:

- `blosc:blosclz <http://www.blosc.org/>`_ This is the
default compressor for ``blosc``
- `blosc:lz4
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
A compact, very popular and fast compressor.
- `blosc:lz4hc
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
A tweaked version of LZ4, produces better
compression ratios at the expense of speed.
- `blosc:snappy <https://google.github.io/snappy/>`_:
A popular compressor used in many places.
- `blosc:zlib <http://zlib.net/>`_: A classic;
somewhat slower than the previous ones, but
achieving better compression ratios.
- `blosc:zstd <https://facebook.github.io/zstd/>`_: An
extremely well balanced codec; it provides the best
compression ratios among the others above, and at
reasonably fast speed.

If ``complib`` is defined as something other than the
listed libraries a ``ValueError`` exception is issued.

- Pass ``complevel=int`` for a compression level (1-9, with 0 being no
compression, and the default)
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for
whichever compression library you prefer.
.. note::

``HDFStore`` will use the file based compression scheme if no overriding
``complib`` or ``complevel`` options are provided. ``blosc`` offers very
fast compression, and is my most used. Note that ``lzo`` and ``bzip2``
may not be installed (by Python) by default.
If the library specified with the ``complib`` option is missing on your platform,
compression defaults to ``zlib`` without further ado.

Compression for all objects within the file
Enable compression for all objects within the file:

.. code-block:: python

store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc')
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc:blosclz')

Or on-the-fly compression (this only applies to tables). You can turn
off file compression for a specific table by passing ``complevel=0``
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:

.. code-block:: python

Expand Down
3 changes: 1 addition & 2 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,12 @@ Backwards incompatible API changes

- Support has been dropped for Python 3.4 (:issue:`15251`)
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)

- Accessing a non-existent attribute on a closed :class:`HDFStore` will now
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`)
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)

- :class:`pandas.HDFStore`'s string representation is now faster and less detailed. For the previous behavior, use ``pandas.HDFStore.info()``. (:issue:`16503`).
- Compression defaults in HDF stores now follow pytable standards. Default is no compression and if ``complib`` is missing and ``complevel`` > 0 ``zlib`` is used (:issue:`15943`)

.. _whatsnew_0210.api:

Expand Down
4 changes: 2 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1278,10 +1278,10 @@ def to_hdf(self, path_or_buf, key, **kwargs):
<http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns>`__.

Applicable only to format='table'.
complevel : int, 0-9, default 0
complevel : int, 0-9, default None
Specifies a compression level for data.
A value of 0 disables compression.
complib : {'zlib', 'lzo', 'bzip2', 'blosc', None}, default None
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
Specifies the compression library to be used.
As of v0.20.2 these additional compressors for Blosc are supported
(default if no compressor specified: 'blosc:blosclz'):
Expand Down
16 changes: 8 additions & 8 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,10 +411,10 @@ class HDFStore(StringMixin):
and if the file does not exist it is created.
``'r+'``
It is similar to ``'a'``, but the file must already exist.
complevel : int, 0-9, default 0
complevel : int, 0-9, default None
Specifies a compression level for data.
A value of 0 disables compression.
complib : {'zlib', 'lzo', 'bzip2', 'blosc', None}, default None
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
Specifies the compression library to be used.
As of v0.20.2 these additional compressors for Blosc are supported
(default if no compressor specified: 'blosc:blosclz'):
Expand Down Expand Up @@ -449,12 +449,15 @@ def __init__(self, path, mode=None, complevel=None, complib=None,
"complib only supports {libs} compression.".format(
libs=tables.filters.all_complibs))

if complib is None and complevel is not None:
complib = tables.filters.default_complib

self._path = _stringify_path(path)
if mode is None:
mode = 'a'
self._mode = mode
self._handle = None
self._complevel = complevel
self._complevel = complevel if complevel else 0
self._complib = complib
self._fletcher32 = fletcher32
self._filters = None
Expand Down Expand Up @@ -566,11 +569,8 @@ def open(self, mode='a', **kwargs):
if self.is_open:
self.close()

if self._complib is not None:
if self._complevel is None:
self._complevel = 9
self._filters = _tables().Filters(self._complevel,
self._complib,
if self._complevel and self._complevel > 0:
self._filters = _tables().Filters(self._complevel, self._complib,
fletcher32=self._fletcher32)

try:
Expand Down
53 changes: 53 additions & 0 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -736,6 +736,59 @@ def test_put_compression_blosc(self):
store.put('c', df, format='table', complib='blosc')
tm.assert_frame_equal(store['c'], df)

def test_complibs_default_settings(self):
# GH15943
df = tm.makeDataFrame()

# Set complevel and check if complib is automatically set to
# default value
with ensure_clean_path(self.path) as tmpfile:
df.to_hdf(tmpfile, 'df', complevel=9)
result = pd.read_hdf(tmpfile, 'df')
tm.assert_frame_equal(result, df)

with tables.open_file(tmpfile, mode='r') as h5file:
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
assert node.filters.complevel == 9
assert node.filters.complib == 'zlib'

# Set complib and check to see if compression is disabled
with ensure_clean_path(self.path) as tmpfile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense. you are specifying a complib, yet no compression? again this is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the 3 cases where at least one of the parameters complib and complevel is missing. Setting complib and leaving complevel unset is one of those. I think these are good tests to see if the defaults behave as we discussed.

I did not add tests to see what would happen if i set the parameters to illegal values, fx what happens if complevel is negative?

I also added the testcase where i override the file-wide defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some tests for illegal values, though PyTables should actually validate these

df.to_hdf(tmpfile, 'df', complib='zlib')
result = pd.read_hdf(tmpfile, 'df')
tm.assert_frame_equal(result, df)

with tables.open_file(tmpfile, mode='r') as h5file:
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
assert node.filters.complevel == 0
assert node.filters.complib is None

# Check if not setting complib or complevel results in no compression
with ensure_clean_path(self.path) as tmpfile:
df.to_hdf(tmpfile, 'df')
result = pd.read_hdf(tmpfile, 'df')
tm.assert_frame_equal(result, df)

with tables.open_file(tmpfile, mode='r') as h5file:
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
assert node.filters.complevel == 0
assert node.filters.complib is None

# Check if file-defaults can be overridden on a per table basis
with ensure_clean_path(self.path) as tmpfile:
store = pd.HDFStore(tmpfile)
store.append('dfc', df, complevel=9, complib='blosc')
store.append('df', df)
store.close()

with tables.open_file(tmpfile, mode='r') as h5file:
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
assert node.filters.complevel == 0
assert node.filters.complib is None
for node in h5file.walk_nodes(where='/dfc', classname='Leaf'):
assert node.filters.complevel == 9
assert node.filters.complib == 'blosc'

def test_complibs(self):
# GH14478
df = tm.makeDataFrame()
Expand Down