-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization / Deserialization of ExtensionArrays #20612
Comments
wouldn’t support msgpack generally |
For pickling, ensuring your ExtensionArray picklable should be enough (the rest already works, the blocks pickle the values). For geopandas we have this: |
I am interested in discussing the parquet / arrow aspect of this issue (see eg also apache/arrow#4168, and I would also like to potentially use this in GeoPandas). In general, I think it would be useful to have an interface to convert an ExtensionArray to an arrow array, and the other way around. Personally, I think this responsibility should be on the ExtensionArray implementation itself (pyarrow should not need to know about all EAs, but only about a specific interface). So eg we could add
That would allow pyarrow to call those methods when receiving an ExtensionArray, and in principle (if they want that) even restore it when converting back to pandas if they store the extension dtype name in the pandas metadata (with something like |
Seems reasonable to reuse arrow where it has readers / writers. Do we foresee problems with overloading to_arrow for both conversion to arrow and parquet IO? I don’t really see any problems right now. |
One case I can think of is if you would have a custom data type that maps nicely to a certain arrow type (eg ragged arrays / multipolygons as lists of lists), but which is not supported in the specific IO option (eg the current arrow <-> parquet support for nested data is limited), so you might want to choose a suboptimal one for a certain IO option (eg binary blob instead). |
To have it less pandas EA-specific, arrow could also look for a |
To update on the pyarrow aspect of this issue:
Those improvements are about pandas -> arrow conversion. To also do arrow -> pandas (to roundtrip those pandas ExtensionArray columns), there is discussion in https://issues.apache.org/jira/browse/ARROW-2428. |
Sorry I'm a bit confused. This would be a method on |
That's also possible, but that's just another step of indirection ( |
OK, either works. I just typical think of |
Another reason to maybe do it on the dtype, is if you have multiple dtypes mapping to the same array class. Like we do with IntegerArray, or fletcher's FletcherArray (although in those two cases, there is a direct mapping of the arrow type to the pandas dtype, so that would also not be a problem) |
Hello everyone, what is the status on this issue? |
Some of the formats listed in the top post are handled (pickle, parquet), but I don't think we already have a mechanism in place for hooking into JSON (de)serialization. And I don't think anybody already looked into this. The actual conversion from a DataFrame into the json string is implemented in C: pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 1957 in d7eadde
Which values are exactly extracted from the DataFrame and handled in the C code is done at:
My quick guess is that if your ExtensionArray converts to a numpy array with "known" objects (eg datetime.date), I would expect that it can actually already work. Long term, I don't know what the best solution would be if you want to do something more custom than having a numpy array with JSON serializable objects. |
My extension divides the date into three numpy arrays: year, month and day - as you can expect, these are normal numbers and thus JSON serializable, however, they're being serialized as strings. I think that something is happening between the inner storage and the serialization. It seems that before conversion, it converts the arrays into dates again. I am open to suggestions, especially the name of the methods I need to implement to ensure that my My PoC is available in https://gist.github.com/jmg-duarte/4a518f3c9ff484a575f336b71f62b0e1. The code is fairly simple, I based it on https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/span.py and reviewed the pandas code it points to. |
There are a few issues here:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.3.4'
In [3]: df = pd.DataFrame({"adate" : pd.to_datetime(["11/29/2021", "11/30/2021"])})
In [4]: df
Out[4]:
adate
0 2021-11-29
1 2021-11-30
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 adate 2 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [6]: rb = pd.read_json(df.to_json(orient="table"), orient="table")
In [7]: rb
Out[7]:
adate
0 2021-11-29
1 2021-11-30
In [8]: rb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 adate 2 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 32.0 bytes |
@Dr-Irv your example is using datetime64 column, not In [59]: import datetime
In [60]: df = pd.DataFrame({"date": [datetime.date.today()]})
In [61]: df.to_json(orient='table')
Out[61]: '{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"date","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"date":"2021-12-01T00:00:00.000Z"}]}' So there are 2 problems with this:
|
Addressing point 3: What you did is side-stepping the actual problem. Putting it simply, I have actual The reason as to why I am using an I will be more explicit about what happens and what I expected to happen. I also provide some more context: I work for a company which handles timeseries data, as such we have DataFrames with columns where the Currently, both import datetime as dt
import pandas as pd
date_df = pd.DataFrame({"d": [dt.date.today()]})
date_df_json = date_df.to_json(orient='table')
date_df_back = pd.read_json(date_df_json, orient='table')
assert date_df.equals(date_df_back) # fails
time_df = pd.DataFrame({"t": [dt.time()]})
time_df_json = time_df.to_json(orient='table')
time_df_back = pd.read_json(time_df_json, orient='table')
assert time_df.equals(time_df_back) # fails
What I was expecting was that The comment from @jorisvandenbossche in #32037 (comment) makes a lot of sense and it is in fact, backwards compatible (I tested it, pandas ignores "unknown" fields and thus we can extend the metadata as we wish). Furthermore, I agree with the comment from @WillAyd (#32037 (comment)), the extra metadata should solve the problem (at least for dates it could). I think this precise topic (dates) should either be moved to a new issue or to #32037, whatever the best might be, I'm more than happy to move the discussion there (or not move it at all). |
(just getting an issue number to link to. Will update later)
What hooks do we want to provide for ExtensionArrays stored inside our containers?
Currently, to_csv works. Haven't really tried others
xref https://github.com/pandas-dev/pandas/pull/20611/files
The text was updated successfully, but these errors were encountered: