-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize all pyarrow
extension arrays efficiently
#9740
Serialize all pyarrow
extension arrays efficiently
#9740
Conversation
dask/dataframe/_pyarrow_compat.py
Outdated
def rebuild_arrowextensionarray(chunks): | ||
array = pa.chunked_array(chunks) | ||
if PANDAS_GT_150: | ||
return pd.arrays.ArrowExtensionArray(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For maximal backwards compat, if the data was pyarrow string I think it will be necessary to know if it was pd.StringDtype("pyarrow")
vs pd.ArrowDtype
. Because then the former should be constructed using ArrowStringArray
while the later constructed via ArrowExtensionArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look @mroeschke. Is the reasoning for this due to the more-feature complete pyarrow back string implementation we talked about earlier in pandas-dev/pandas#50074 (comment)? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. If in 2.0 ArrowExtensionArray
has feature parity with ArrowStringArray
, I suppose you could always use ArrowExtensionArray
since this is all internal dask serialization and not externally pickle per se?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this more and decided to just return whatever the original type is. Given that we're patching pickle
, I think it makes sense to always have the same input / output type. I think this is consistent with your previous comment, but wanted to highlight just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after testing the implementation here against the notebook Ian put together (https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e) I uncovered some issues that have been resolved in the latest commit. Leaving a couple of comments to highlight what was wrong and what changed.
for type_ in [pd.arrays.ArrowExtensionArray, pd.arrays.ArrowStringArray]: | ||
copyreg.dispatch_table[type_] = reduce_arrowextensionarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When, available, we need to make sure to register copyreg
entries for both pd.arrays.ArrowExtensionArray
and pd.arrays.ArrowStringArray
. This way, both pyarrow string implementation in pandas
will pick up the serialization fixes here. I've added a test which makes sure we handle both pd.StringDtype("pyarrow")
and pd.ArrowDtype(pa.string())
cases.
sliced_pickled = pickle.dumps(expected_sliced) | ||
|
||
# Make sure slicing gives a large reduction in serialized bytes | ||
assert len(full_pickled) > len(sliced_pickled) * 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this assert
was assert len(full_pickled) > len(sliced_pickled)
. It turns out that even without any of the serialization patches here, this assert
would still pass. This is because pickled sliced extensions arrays were still smaller than the pickled original extension array, but only a tiny bit smaller (e.g. 80.20 kiB for the sliced array vs. 80.29 kiB for the originally array). When we apply the copyreg
patches, we see a much larger reduction in the serialized size (e.g. 799 B for the sliced array vs. 80.23 kiB for the original array).
I've gone ahead and added an extra factor of 3
here to mean "we see a significant reduction in the serialized size".
cc @mroeschke as I suspect the corresponding tests in pandas
will see something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for the tip! I'll go ahead and make this fix stricter on pandas side too then just to validate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interestingly when testing pandas's similar test I am not seeing a huge difference
______________________________________________________________________ test_pickle_roundtrip ______________________________________________________________________
@skip_if_no_pyarrow
def test_pickle_roundtrip():
# GH 42600
expected = pd.Series(range(10), dtype="string[pyarrow]")
expected_sliced = expected.head(2)
full_pickled = pickle.dumps(expected)
sliced_pickled = pickle.dumps(expected_sliced)
# Testing that pickling the sliced object results in a _significant_ (2x)
# reduction in serialized size
> assert len(full_pickled) > len(sliced_pickled) * 2
E assert 818 > (778 * 2)
E + where 818 = len(b"\x80\x04\x95'\x03\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\x8c..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.")
E + and 778 = len(b'\x80\x04\x95\xff\x02\x00\x00\x00\x00\x00\x00\x8c\x12pandas.core.series\x94\x8c\x06Series\x94\x93\x94)\x81\x94}\x94(\..._metadata\x94]\x94h\x12a\x8c\x05attrs\x94}\x94\x8c\x06_flags\x94}\x94\x8c\x17allows_duplicate_labels\x94\x88sh\x12Nub.')
pandas/tests/arrays/string_/test_string_arrow.py:213: AssertionError
In [1]: 818 / 778
Out[1]: 1.051413881748072
Still a reduction but I'll probably just leave this test alone on the pandas side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Planning to merge tomorrow if not further comments
Thanks for this @jrbourbeau and @mroeschke! |
Thanks for all the initial work @ian-r-rose! That notebook was super useful |
This PR swaps out our custom logic for
pickle
ingarrow
-backed extension arrays with the implementation in the upcomingpandas=2
release (xref pandas-dev/pandas#49078). As discussed in #9613, the new implementation is much more straightforward, while being roughly as performant. It also applies to allArrowExtensionArray
s, not justArrowStringArray
.I'll want to run the changes here against the notebook Ian provided in #9613 to make sure the performance benchmarks still hold, but the changes here should be ready for review.
cc @mroeschke for visibility
Closes #9613