-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers #3868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'start': level._start, | ||
'stop': level._stop, | ||
'step': level._step | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @jorisvandenbossche is there a better way to get the start/stop/step for RangeIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, we don't expose those publicly, so the private ones you are using here is the best pandas has to offer.
They also have been stable over the years, so I don't think it is a problem to use them. I don't really recall the history of not exposing them (maybe because we don't really want end-users to rely too much on the fact it is special, but just see it as a memory optimized integer index).
Before:
after
cc @TomAugspurger @mrocklin in case of interest |
I'll add this to the ASV suite so we can track it. @xhochy I will need you to review this as there was some changes required to the Parquet test suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear that this will break backwards compability. We should first introduce some tests where we have hardcoded the pandas schema JSON and then ensure that we still can read old versions. We had some problems in the past where pyarrow
refused to read old Parquet files due to breaking changes in the pandas metadata. There have been none recently as we haven't touched this code for a while.
@@ -493,7 +493,7 @@ def test_recordbatchlist_to_pandas(): | |||
|
|||
table = pa.Table.from_batches([batch1, batch2]) | |||
result = table.to_pandas() | |||
data = pd.concat([data1, data2]) | |||
data = pd.concat([data1, data2]).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this now necessary and why did it work before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked before because the pandas indexes had been converted to data columns. This change makes the RangeIndex metadata instead
types : List[pyarrow.DataType] | ||
|
||
Returns | ||
------- | ||
dict | ||
""" | ||
num_serialized_index_levels = len([descr for descr in index_descriptors | ||
if descr['kind'] == 'serialized']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'kind'
is a new attribute that is not set in old metadata descriptions. Is this somewhere handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to make a PR to resolve the pandas issue adding this. I have implemented backwards compatibility already for this, but I will add some unit tests to assert
@xhochy I'll add some unit tests with hard-coded "old" metadata to ensure that old files can still be read correctly. |
By the way, the kind of compatibility we are interested in is forward compatibility:
I'll get this sorted out today |
…verting to data column. This affects serialize_pandas and writing to Parquet format
I added the forward compatibility tests. I also added a version number to the metadata
|
Appveyor build: https://ci.appveyor.com/project/wesm/arrow/builds/23025530 It would be good to merge this soon if it looks OK. I am working on ARROW-4637 which conflicts with these changes because of refactoring I did in |
I'll just rebase my wip patch on top of this branch for now so I'm not blocked |
I'm working on a PR into pandas also, in case we need to make alterations to the metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Maybe @jorisvandenbossche or @TomAugspurger also want to take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on a quick glance. Opened pandas-dev/pandas#25710 for the RangeIndex attribute issue, but as Joris said, I don't expect the private versions to change.
thanks everyone |
This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py.
This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling
reset_index
to convert a RangeIndex if that's what is desired.This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in
pandas-dev/pandas#25672