-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-356: Add documentation about reading Parquet #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b5b4df5
744202a
0467e0e
06b2f9c
530484f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| .. Licensed to the Apache Software Foundation (ASF) under one | ||
| .. or more contributor license agreements. See the NOTICE file | ||
| .. distributed with this work for additional information | ||
| .. regarding copyright ownership. The ASF licenses this file | ||
| .. to you under the Apache License, Version 2.0 (the | ||
| .. "License"); you may not use this file except in compliance | ||
| .. with the License. You may obtain a copy of the License at | ||
|
|
||
| .. http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| .. Unless required by applicable law or agreed to in writing, | ||
| .. software distributed under the License is distributed on an | ||
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| .. KIND, either express or implied. See the License for the | ||
| .. specific language governing permissions and limitations | ||
| .. under the License. | ||
|
|
||
| Install PyArrow | ||
| =============== | ||
|
|
||
| Conda | ||
| ----- | ||
|
|
||
| To install the latest version of PyArrow from conda-forge using conda: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| conda install -c conda-forge pyarrow | ||
|
|
||
| Pip | ||
| --- | ||
|
|
||
| Install the latest version from PyPI: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| pip install pyarrow | ||
|
|
||
| .. note:: | ||
| Currently there are only binary artifcats available for Linux and MacOS. | ||
| Otherwise this will only pull the python sources and assumes an existing | ||
| installation of the C++ part of Arrow. | ||
| To retrieve the binary artifacts, you'll need a recent ``pip`` version that | ||
| supports features like the ``manylinux1`` tag. | ||
|
|
||
| Building from source | ||
| -------------------- | ||
|
|
||
| First, clone the master git repository: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| git clone https://github.com/apache/arrow.git arrow | ||
|
|
||
| System requirements | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Building pyarrow requires: | ||
|
|
||
| * A C++11 compiler | ||
|
|
||
| * Linux: gcc >= 4.8 or clang >= 3.5 | ||
| * OS X: XCode 6.4 or higher preferred | ||
|
|
||
| * `CMake <https://cmake.org/>`_ | ||
|
|
||
| Python requirements | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and | ||
| are not being targeted. | ||
|
|
||
| .. note:: | ||
| This library targets CPython only due to an emphasis on interoperability with | ||
| pandas and NumPy, which are only available for CPython. | ||
|
|
||
| The build requires NumPy, Cython, and a few other Python dependencies: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| pip install cython | ||
| cd arrow/python | ||
| pip install -r requirements.txt | ||
|
|
||
| Installing Arrow C++ library | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| First, you should choose an installation location for Arrow C++. In the future | ||
| using the default system install location will work, but for now we are being | ||
| explicit: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export ARROW_HOME=$HOME/local | ||
|
|
||
| Now, we build Arrow: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| cd arrow/cpp | ||
|
|
||
| mkdir dev-build | ||
| cd dev-build | ||
|
|
||
| cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. | ||
|
|
||
| make | ||
|
|
||
| # Use sudo here if $ARROW_HOME requires it | ||
| make install | ||
|
|
||
| To get the optional Parquet support, you should also build and install | ||
| `parquet-cpp <https://github.com/apache/parquet-cpp/blob/master/README.md>`_. | ||
|
|
||
| Install `pyarrow` | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
|
|
||
| .. code-block:: bash | ||
|
|
||
| cd arrow/python | ||
|
|
||
| # --with-parquet enable the Apache Parquet support in PyArrow | ||
| # --build-type=release disables debugging information and turns on | ||
| # compiler optimizations for native code | ||
| python setup.py build_ext --with-parquet --build-type=release install | ||
| python setup.py install | ||
|
|
||
| .. warning:: | ||
| On XCode 6 and prior there are some known OS X `@rpath` issues. If you are | ||
| unable to import pyarrow, upgrading XCode may be the solution. | ||
|
|
||
| .. note:: | ||
| In development installations, you will also need to set a correct | ||
| ``LD_LIBARY_PATH``. This is most probably done with | ||
| ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``. | ||
|
|
||
|
|
||
| .. code-block:: python | ||
|
|
||
| In [1]: import pyarrow | ||
|
|
||
| In [2]: pyarrow.from_pylist([1,2,3]) | ||
| Out[2]: | ||
| <pyarrow.array.Int64Array object at 0x7f899f3e60e8> | ||
| [ | ||
| 1, | ||
| 2, | ||
| 3 | ||
| ] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| .. Licensed to the Apache Software Foundation (ASF) under one | ||
| .. or more contributor license agreements. See the NOTICE file | ||
| .. distributed with this work for additional information | ||
| .. regarding copyright ownership. The ASF licenses this file | ||
| .. to you under the Apache License, Version 2.0 (the | ||
| .. "License"); you may not use this file except in compliance | ||
| .. with the License. You may obtain a copy of the License at | ||
|
|
||
| .. http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| .. Unless required by applicable law or agreed to in writing, | ||
| .. software distributed under the License is distributed on an | ||
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| .. KIND, either express or implied. See the License for the | ||
| .. specific language governing permissions and limitations | ||
| .. under the License. | ||
|
|
||
| Pandas Interface | ||
| ================ | ||
|
|
||
| To interface with Pandas, PyArrow provides various conversion routines to | ||
| consume Pandas structures and convert back to them. | ||
|
|
||
| DataFrames | ||
| ---------- | ||
|
|
||
| The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. | ||
| Both consist of a set of named columns of equal length. While Pandas only | ||
| supports flat columns, the Table also provides nested columns, thus it can | ||
| represent more data than a DataFrame, so a full conversion is not always possible. | ||
|
|
||
| Conversion from a Table to a DataFrame is done by calling | ||
| :meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using | ||
| :meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the | ||
| convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of | ||
| different resolutions, Pandas only supports nanosecond timestamps and most | ||
| other systems (e.g. Parquet) only work on millisecond timestamps. This parameter | ||
| can be used to already do the time conversion during the Pandas to Arrow | ||
| conversion. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import pyarrow as pa | ||
| import pandas as pd | ||
|
|
||
| df = pd.DataFrame({"a": [1, 2, 3]}) | ||
| # Convert from Pandas to Arrow | ||
| table = pa.from_pandas_dataframe(df) | ||
| # Convert back to Pandas | ||
| df_new = table.to_pandas() | ||
|
|
||
|
|
||
| Series | ||
| ------ | ||
|
|
||
| In Arrow, the most similar structure to a Pandas Series is an Array. | ||
| It is a vector that contains data of the same type as linear memory. You can | ||
| convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. | ||
| As Arrow Arrays are always nullable, you can supply an optional mask using | ||
| the ``mask`` parameter to mark all null-entries. | ||
|
|
||
| Type differences | ||
| ---------------- | ||
|
|
||
| With the current design of Pandas and Arrow, it is not possible to convert all | ||
| column types unmodified. One of the main issues here is that Pandas has no | ||
| support for nullable columns of arbitrary type. Also ``datetime64`` is currently | ||
| fixed to nanosecond resolution. On the other side, Arrow might be still missing | ||
| support for some types. | ||
|
|
||
| Pandas -> Arrow Conversion | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| +------------------------+--------------------------+ | ||
| | Source Type (Pandas) | Destination Type (Arrow) | | ||
| +========================+==========================+ | ||
| | ``bool`` | ``BOOL`` | | ||
| +------------------------+--------------------------+ | ||
| | ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` | | ||
| +------------------------+--------------------------+ | ||
| | ``float32`` | ``FLOAT`` | | ||
| +------------------------+--------------------------+ | ||
| | ``float64`` | ``DOUBLE`` | | ||
| +------------------------+--------------------------+ | ||
| | ``str`` / ``unicode`` | ``STRING`` | | ||
| +------------------------+--------------------------+ | ||
| | ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | | ||
| +------------------------+--------------------------+ | ||
| | ``pd.Categorical`` | *not supported* | | ||
| +------------------------+--------------------------+ | ||
|
|
||
| Arrow -> Pandas Conversion | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | Source Type (Arrow) | Destination Type (Pandas) | | ||
| +=====================================+========================================================+ | ||
| | ``BOOL`` | ``bool`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``FLOAT`` | ``float32`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``DOUBLE`` | ``float64`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``STRING`` | ``str`` | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
| | ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | | ||
| +-------------------------------------+--------------------------------------------------------+ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yikes, do you have a tool to do this? We could also use the csv-table directive
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have Probably once someone touches this, we could use the |
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revise this to add the
python setup.py build_ext --with-parquet install. you may also want to mention--build-type=release