Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 0 additions & 101 deletions python/doc/INSTALL.md

This file was deleted.

16 changes: 9 additions & 7 deletions python/doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,16 @@ additional functionality such as reading Apache Parquet files into Arrow
structures.

.. toctree::
:maxdepth: 4
:hidden:
:maxdepth: 2
:caption: Getting Started

Installing pyarrow <install.rst>
Pandas <pandas.rst>
Module Reference <modules.rst>

Indices and tables
==================
.. toctree::
:maxdepth: 2
:caption: Additional Features

Parquet format <parquet.rst>

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
151 changes: 151 additions & 0 deletions python/doc/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

Install PyArrow
===============

Conda
-----

To install the latest version of PyArrow from conda-forge using conda:

.. code-block:: bash

conda install -c conda-forge pyarrow

Pip
---

Install the latest version from PyPI:

.. code-block:: bash

pip install pyarrow

.. note::
Currently there are only binary artifcats available for Linux and MacOS.
Otherwise this will only pull the python sources and assumes an existing
installation of the C++ part of Arrow.
To retrieve the binary artifacts, you'll need a recent ``pip`` version that
supports features like the ``manylinux1`` tag.

Building from source
--------------------

First, clone the master git repository:

.. code-block:: bash

git clone https://github.com/apache/arrow.git arrow

System requirements
~~~~~~~~~~~~~~~~~~~

Building pyarrow requires:

* A C++11 compiler

* Linux: gcc >= 4.8 or clang >= 3.5
* OS X: XCode 6.4 or higher preferred

* `CMake <https://cmake.org/>`_

Python requirements
~~~~~~~~~~~~~~~~~~~

You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
are not being targeted.

.. note::
This library targets CPython only due to an emphasis on interoperability with
pandas and NumPy, which are only available for CPython.

The build requires NumPy, Cython, and a few other Python dependencies:

.. code-block:: bash

pip install cython
cd arrow/python
pip install -r requirements.txt

Installing Arrow C++ library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, you should choose an installation location for Arrow C++. In the future
using the default system install location will work, but for now we are being
explicit:

.. code-block:: bash

export ARROW_HOME=$HOME/local

Now, we build Arrow:

.. code-block:: bash

cd arrow/cpp

mkdir dev-build
cd dev-build

cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..

make

# Use sudo here if $ARROW_HOME requires it
make install

To get the optional Parquet support, you should also build and install
`parquet-cpp <https://github.com/apache/parquet-cpp/blob/master/README.md>`_.

Install `pyarrow`
~~~~~~~~~~~~~~~~~


.. code-block:: bash

cd arrow/python

# --with-parquet enable the Apache Parquet support in PyArrow
# --build-type=release disables debugging information and turns on
# compiler optimizations for native code
python setup.py build_ext --with-parquet --build-type=release install
python setup.py install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise this to add the python setup.py build_ext --with-parquet install. you may also want to mention --build-type=release


.. warning::
On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
unable to import pyarrow, upgrading XCode may be the solution.

.. note::
In development installations, you will also need to set a correct
``LD_LIBARY_PATH``. This is most probably done with
``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``.


.. code-block:: python

In [1]: import pyarrow

In [2]: pyarrow.from_pylist([1,2,3])
Out[2]:
<pyarrow.array.Int64Array object at 0x7f899f3e60e8>
[
1,
2,
3
]

114 changes: 114 additions & 0 deletions python/doc/pandas.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

Pandas Interface
================

To interface with Pandas, PyArrow provides various conversion routines to
consume Pandas structures and convert back to them.

DataFrames
----------

The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
Both consist of a set of named columns of equal length. While Pandas only
supports flat columns, the Table also provides nested columns, thus it can
represent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by calling
:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using
:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the
convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of
different resolutions, Pandas only supports nanosecond timestamps and most
other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
can be used to already do the time conversion during the Pandas to Arrow
conversion.

.. code-block:: python

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from Pandas to Arrow
table = pa.from_pandas_dataframe(df)
# Convert back to Pandas
df_new = table.to_pandas()


Series
------

In Arrow, the most similar structure to a Pandas Series is an Array.
It is a vector that contains data of the same type as linear memory. You can
convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
As Arrow Arrays are always nullable, you can supply an optional mask using
the ``mask`` parameter to mark all null-entries.

Type differences
----------------

With the current design of Pandas and Arrow, it is not possible to convert all
column types unmodified. One of the main issues here is that Pandas has no
support for nullable columns of arbitrary type. Also ``datetime64`` is currently
fixed to nanosecond resolution. On the other side, Arrow might be still missing
support for some types.

Pandas -> Arrow Conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~

+------------------------+--------------------------+
| Source Type (Pandas) | Destination Type (Arrow) |
+========================+==========================+
| ``bool`` | ``BOOL`` |
+------------------------+--------------------------+
| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` |
+------------------------+--------------------------+
| ``float32`` | ``FLOAT`` |
+------------------------+--------------------------+
| ``float64`` | ``DOUBLE`` |
+------------------------+--------------------------+
| ``str`` / ``unicode`` | ``STRING`` |
+------------------------+--------------------------+
| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` |
+------------------------+--------------------------+
| ``pd.Categorical`` | *not supported* |
+------------------------+--------------------------+

Arrow -> Pandas Conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~

+-------------------------------------+--------------------------------------------------------+
| Source Type (Arrow) | Destination Type (Pandas) |
+=====================================+========================================================+
| ``BOOL`` | ``bool`` |
+-------------------------------------+--------------------------------------------------------+
| ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) |
+-------------------------------------+--------------------------------------------------------+
| ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` |
+-------------------------------------+--------------------------------------------------------+
| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` |
+-------------------------------------+--------------------------------------------------------+
| ``FLOAT`` | ``float32`` |
+-------------------------------------+--------------------------------------------------------+
| ``DOUBLE`` | ``float64`` |
+-------------------------------------+--------------------------------------------------------+
| ``STRING`` | ``str`` |
+-------------------------------------+--------------------------------------------------------+
| ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) |
+-------------------------------------+--------------------------------------------------------+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, do you have a tool to do this? We could also use the csv-table directive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have vim :)

Probably once someone touches this, we could use the csv-table. But I like this variant as it is also readable without being rendered.


Loading