Skip to content

Commit f433061

Browse files
authored
ENH: add to/from_parquet with pyarrow & fastparquet (#15838)
1 parent 3ed51c2 commit f433061

20 files changed

+703
-12
lines changed

ci/install_travis.sh

+1
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ fi
153153
echo
154154
echo "[removing installed pandas]"
155155
conda remove pandas -y --force
156+
pip uninstall -y pandas
156157

157158
if [ "$BUILD_TEST" ]; then
158159

ci/requirements-2.7.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet

ci/requirements-3.5.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
8-
97
# pip install python-dateutil to get latest
108
conda remove -n pandas python-dateutil --force
119
pip install python-dateutil
10+
11+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

ci/requirements-3.5_OSX.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format==0.3.1
7+
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet

ci/requirements-3.6.pip

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

ci/requirements-3.6.run

+2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ pymysql
1717
feather-format
1818
pyarrow
1919
psycopg2
20+
python-snappy
21+
fastparquet
2022
beautifulsoup4
2123
s3fs
2224
xarray

ci/requirements-3.6_DOC.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc
9+
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc fastparquet
1010

1111
conda install -n pandas -c r r rpy2 --yes

ci/requirements-3.6_WIN.run

+2
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ numexpr
1313
pytables
1414
matplotlib
1515
blosc
16+
fastparquet
17+
pyarrow

doc/source/install.rst

+1
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ Optional Dependencies
237237
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
238238
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
239239
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
240+
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
240241
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
241242

242243
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

+78-4
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -209,7 +210,7 @@ buffer_lines : int, default None
209210
.. deprecated:: 0.19.0
210211

211212
Argument removed because its value is not respected by the parser
212-
213+
213214
compact_ints : boolean, default False
214215
.. deprecated:: 0.19.0
215216

@@ -4087,7 +4088,7 @@ control compression: ``complevel`` and ``complib``.
40874088
``complevel`` specifies if and how hard data is to be compressed.
40884089
``complevel=0`` and ``complevel=None`` disables
40894090
compression and ``0<complevel<10`` enables compression.
4090-
4091+
40914092
``complib`` specifies which compression library to use. If nothing is
40924093
specified the default library ``zlib`` is used. A
40934094
compression library usually optimizes for either good
@@ -4102,9 +4103,9 @@ control compression: ``complevel`` and ``complib``.
41024103
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.
41034104

41044105
.. versionadded:: 0.20.2
4105-
4106+
41064107
Support for alternative blosc compressors:
4107-
4108+
41084109
- `blosc:blosclz <http://www.blosc.org/>`_ This is the
41094110
default compressor for ``blosc``
41104111
- `blosc:lz4
@@ -4545,6 +4546,79 @@ Read from a feather file.
45454546
import os
45464547
os.remove('example.feather')
45474548
4549+
4550+
.. _io.parquet:
4551+
4552+
Parquet
4553+
-------
4554+
4555+
.. versionadded:: 0.21.0
4556+
4557+
`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to
4558+
make reading and writing data frames efficient, and to make sharing data across data analysis
4559+
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
4560+
while still maintaining good read performance.
4561+
4562+
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
4563+
dtypes, including extension dtypes such as datetime with tz.
4564+
4565+
Several caveats.
4566+
4567+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4568+
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index.
4569+
- Duplicate column names and non-string columns names are not supported
4570+
- Categorical dtypes are currently not-supported (for ``pyarrow``).
4571+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4572+
on an attempt at serialization.
4573+
4574+
You can specifiy an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
4575+
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then
4576+
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
4577+
4578+
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
4579+
4580+
.. note::
4581+
4582+
These engines are very similar and should read/write nearly identical parquet format files.
4583+
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4584+
4585+
.. ipython:: python
4586+
4587+
df = pd.DataFrame({'a': list('abc'),
4588+
'b': list(range(1, 4)),
4589+
'c': np.arange(3, 6).astype('u1'),
4590+
'd': np.arange(4.0, 7.0, dtype='float64'),
4591+
'e': [True, False, True],
4592+
'f': pd.date_range('20130101', periods=3),
4593+
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4594+
'h': pd.date_range('20130101', periods=3, freq='ns')})
4595+
4596+
df
4597+
df.dtypes
4598+
4599+
Write to a parquet file.
4600+
4601+
.. ipython:: python
4602+
4603+
df.to_parquet('example_pa.parquet', engine='pyarrow')
4604+
df.to_parquet('example_fp.parquet', engine='fastparquet')
4605+
4606+
Read from a parquet file.
4607+
4608+
.. ipython:: python
4609+
4610+
result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
4611+
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
4612+
4613+
result.dtypes
4614+
4615+
.. ipython:: python
4616+
:suppress:
4617+
4618+
import os
4619+
os.remove('example_pa.parquet')
4620+
os.remove('example_fp.parquet')
4621+
45484622
.. _io.sql:
45494623

45504624
SQL Queries

doc/source/options.rst

+3
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,9 @@ io.hdf.default_format None default format writing format,
414414
'table'
415415
io.hdf.dropna_table True drop ALL nan rows when appending
416416
to a table
417+
io.parquet.engine None The engine to use as a default for
418+
parquet reading and writing. If None
419+
then try 'pyarrow' and 'fastparquet'
417420
mode.chained_assignment warn Raise an exception, warn, or no
418421
action if trying to use chained
419422
assignment, The default is warn

doc/source/whatsnew/v0.21.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Other Enhancements
7878
- :func:`DataFrame.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)
7979
- :func:`date_range` now accepts 'YS' in addition to 'AS' as an alias for start of year (:issue:`9313`)
8080
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
81+
- Integration with Apache Parquet, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.
8182

8283
.. _whatsnew_0210.api_breaking:
8384

pandas/core/config_init.py

+12
Original file line numberDiff line numberDiff line change
@@ -465,3 +465,15 @@ def _register_xlsx(engine, other):
465465
except ImportError:
466466
# fallback
467467
_register_xlsx('openpyxl', 'xlsxwriter')
468+
469+
# Set up the io.parquet specific configuration.
470+
parquet_engine_doc = """
471+
: string
472+
The default parquet reader/writer engine. Available options:
473+
'auto', 'pyarrow', 'fastparquet', the default is 'auto'
474+
"""
475+
476+
with cf.config_prefix('io.parquet'):
477+
cf.register_option(
478+
'engine', 'auto', parquet_engine_doc,
479+
validator=is_one_of_factory(['auto', 'pyarrow', 'fastparquet']))

pandas/core/frame.py

+24
Original file line numberDiff line numberDiff line change
@@ -1598,6 +1598,30 @@ def to_feather(self, fname):
15981598
from pandas.io.feather_format import to_feather
15991599
to_feather(self, fname)
16001600

1601+
def to_parquet(self, fname, engine='auto', compression='snappy',
1602+
**kwargs):
1603+
"""
1604+
Write a DataFrame to the binary parquet format.
1605+
1606+
.. versionadded:: 0.21.0
1607+
1608+
Parameters
1609+
----------
1610+
fname : str
1611+
string file path
1612+
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
1613+
Parquet reader library to use. If 'auto', then the option
1614+
'io.parquet.engine' is used. If 'auto', then the first
1615+
library to be installed is used.
1616+
compression : str, optional, default 'snappy'
1617+
compression method, includes {'gzip', 'snappy', 'brotli'}
1618+
kwargs
1619+
Additional keyword arguments passed to the engine
1620+
"""
1621+
from pandas.io.parquet import to_parquet
1622+
to_parquet(self, fname, engine,
1623+
compression=compression, **kwargs)
1624+
16011625
@Substitution(header='Write out column names. If a list of string is given, \
16021626
it is assumed to be aliases for the column names')
16031627
@Appender(fmt.docstring_to_string, indents=1)

pandas/io/api.py

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from pandas.io.sql import read_sql, read_sql_table, read_sql_query
1414
from pandas.io.sas import read_sas
1515
from pandas.io.feather_format import read_feather
16+
from pandas.io.parquet import read_parquet
1617
from pandas.io.stata import read_stata
1718
from pandas.io.pickle import read_pickle, to_pickle
1819
from pandas.io.packers import read_msgpack, to_msgpack

pandas/io/feather_format.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def _try_import():
1919
"you can install via conda\n"
2020
"conda install feather-format -c conda-forge\n"
2121
"or via pip\n"
22-
"pip install feather-format\n")
22+
"pip install -U feather-format\n")
2323

2424
try:
2525
feather.__version__ >= LooseVersion('0.3.1')
@@ -29,7 +29,7 @@ def _try_import():
2929
"you can install via conda\n"
3030
"conda install feather-format -c conda-forge"
3131
"or via pip\n"
32-
"pip install feather-format\n")
32+
"pip install -U feather-format\n")
3333

3434
return feather
3535

0 commit comments

Comments
 (0)