ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers #3868

wesm · 2019-03-11T21:46:56Z

This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py.

This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling reset_index to convert a RangeIndex if that's what is desired.

This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in

pandas-dev/pandas#25672

wesm · 2019-03-11T21:47:46Z

python/pyarrow/pandas_compat.py

+        'start': level._start,
+        'stop': level._stop,
+        'step': level._step
+    }


@jreback @jorisvandenbossche is there a better way to get the start/stop/step for RangeIndex?

As far as I know, we don't expose those publicly, so the private ones you are using here is the best pandas has to offer.
They also have been stable over the years, so I don't think it is a problem to use them. I don't really recall the history of not exposing them (maybe because we don't really want end-users to rely too much on the fact it is special, but just see it as a memory optimized integer index).

wesm · 2019-03-11T22:23:32Z

Before:

size = 100_000_000
df = pd.DataFrame({'a': np.random.randn(size)})

>>> %timeit serialized = pa.serialize_pandas(df)
720 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df_result = pa.deserialize_pandas(serialized)
417 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

>>> %timeit serialized = pa.serialize_pandas(df)
381 ms ± 6.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df_result = pa.deserialize_pandas(serialized)
282 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cc @TomAugspurger @mrocklin in case of interest

wesm · 2019-03-12T01:30:45Z

I'll add this to the ASV suite so we can track it. @xhochy I will need you to review this as there was some changes required to the Parquet test suite

xhochy

I fear that this will break backwards compability. We should first introduce some tests where we have hardcoded the pandas schema JSON and then ensure that we still can read old versions. We had some problems in the past where pyarrow refused to read old Parquet files due to breaking changes in the pandas metadata. There have been none recently as we haven't touched this code for a while.

xhochy · 2019-03-12T15:24:06Z

python/pyarrow/tests/test_table.py

@@ -493,7 +493,7 @@ def test_recordbatchlist_to_pandas():

    table = pa.Table.from_batches([batch1, batch2])
    result = table.to_pandas()
-    data = pd.concat([data1, data2])
+    data = pd.concat([data1, data2]).reset_index(drop=True)


Why is this now necessary and why did it work before?

It worked before because the pandas indexes had been converted to data columns. This change makes the RangeIndex metadata instead

xhochy · 2019-03-12T15:51:03Z

python/pyarrow/pandas_compat.py

    types : List[pyarrow.DataType]

    Returns
    -------
    dict
    """
+    num_serialized_index_levels = len([descr for descr in index_descriptors
+                                       if descr['kind'] == 'serialized'])


'kind' is a new attribute that is not set in old metadata descriptions. Is this somewhere handled?

I need to make a PR to resolve the pandas issue adding this. I have implemented backwards compatibility already for this, but I will add some unit tests to assert

wesm · 2019-03-12T16:57:01Z

@xhochy I'll add some unit tests with hard-coded "old" metadata to ensure that old files can still be read correctly.

wesm · 2019-03-12T16:58:30Z

By the way, the kind of compatibility we are interested in is forward compatibility:

Forward compatibility: new readers can read old files
Backward compatibility: old readers can read new files

I'll get this sorted out today

…verting to data column. This affects serialize_pandas and writing to Parquet format

…andas metadata

wesm · 2019-03-12T21:47:16Z

I added the forward compatibility tests. I also added a version number to the metadata

In [3]: size = 10                                                                                                                       

In [4]: df = pd.DataFrame({'a': np.random.randn(size)})                                                                                 

In [8]: json.loads(pa.Table.from_pandas(df).schema.metadata[b'pandas'])                                                                 
Out[8]: 
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 10,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'a',
   'field_name': 'a',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None}],
 'creator': {'library': 'pyarrow',
  'version': '0.12.1.dev381+g0ca1bfc58.d20190312'},
 'pandas_version': '0.23.4'}

…simpler

wesm · 2019-03-13T02:39:24Z

Appveyor build: https://ci.appveyor.com/project/wesm/arrow/builds/23025530

It would be good to merge this soon if it looks OK. I am working on ARROW-4637 which conflicts with these changes because of refactoring I did in pyarrow.pandas_compat

wesm · 2019-03-13T13:36:30Z

I'll just rebase my wip patch on top of this branch for now so I'm not blocked

wesm · 2019-03-13T14:38:25Z

I'm working on a PR into pandas also, in case we need to make alterations to the metadata

xhochy

+1, LGTM

Maybe @jorisvandenbossche or @TomAugspurger also want to take a look?

TomAugspurger

LGTM on a quick glance. Opened pandas-dev/pandas#25710 for the RangeIndex attribute issue, but as Joris said, I don't expect the private versions to change.

wesm · 2019-03-13T17:17:44Z

thanks everyone

wesm commented Mar 11, 2019

View reviewed changes

xhochy reviewed Mar 12, 2019

View reviewed changes

wesm added 3 commits March 12, 2019 16:26

Serialize RangeIndex as metadata via Table.from_pandas instead of con…

9ba4131

…verting to data column. This affects serialize_pandas and writing to Parquet format

Add benchmark

0ca1bfc

Add compatibility tests for pre-0.13 metadata. Add Arrow version to p…

670dc6f

…andas metadata

wesm force-pushed the ARROW-1639 branch from dbbb2a4 to 670dc6f Compare March 12, 2019 21:46

Add pandas_metadata attribute to pyarrow.Schema to make interactions …

ec929ae

…simpler

xhochy approved these changes Mar 13, 2019

View reviewed changes

TomAugspurger mentioned this pull request Mar 13, 2019

Make public start, stop and step attributes for RangeIndex pandas-dev/pandas#25710

Closed

TomAugspurger approved these changes Mar 13, 2019

View reviewed changes

wesm closed this in 86f480a Mar 13, 2019

wesm deleted the ARROW-1639 branch March 13, 2019 17:17

bchu mentioned this pull request Apr 2, 2019

Cannot read pyarrow RangeIndex dask/fastparquet#414

Closed

asfimport mentioned this pull request Mar 13, 2019

[Python] More efficient serialization for RangeIndex in serialize_pandas #17649

Closed

ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers #3868

ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers #3868

Uh oh!

Conversation

wesm commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 11, 2019

Uh oh!

wesm commented Mar 12, 2019

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

xhochy Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

xhochy Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 12, 2019

Uh oh!

wesm commented Mar 12, 2019

Uh oh!

wesm commented Mar 12, 2019

Uh oh!

wesm commented Mar 13, 2019

Uh oh!

wesm commented Mar 13, 2019

Uh oh!

wesm commented Mar 13, 2019

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 13, 2019

Uh oh!

Uh oh!

wesm commented Mar 11, 2019 •

edited

Loading