[Dataset] ray failed to serialize pyarrow7.0.0 Tables #22310

scv119 · 2022-02-11T07:01:33Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

See #22253 (comment) and #22177 for more context.

TL;DR, when serialize pyarrow7.0.0 Tables, the serialization has wrong data size (as big as 200GBs).

Versions / Dependencies

pyarrow 7.0.0

Reproduction script

python ray/release/nightly_tests/dataset/ray_sgd_runner.py --num-epochs 1 --smoke-test
with pyarrow 7.0.0 installed

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

clarkzinzow · 2022-02-19T07:05:34Z

Hmm interesting, @scv119 I tried to reproduce this locally and got ~the same high-water mark for object store utilization and number of bytes consumed by tasks when comparing pyarrow==6.0.1 and pyarrow==7.0.0. 🤔

ericl · 2022-02-23T07:19:57Z

Here's a minimal repro. With pyarrow==6.0.1 it works fine. With 7.0, you get a 400GiB plasma allocated block...

import ray

ray.data.range_arrow(100e6, parallelism=1).write_parquet("/tmp/big")
ray.data.read_parquet("/tmp/big").show()

The issue seems to happen when putting an Arrow table read from parquet into plasma.

ericl · 2022-02-23T07:32:06Z

Inserting a copy() of the block seems to resolve the issue. Maybe there's some large sparse array that's allocated in the pyarrow object when read from parquet in pyarrow 7.0.0.

We can insert the copy as a workaround for that version, and file a bug upstream.

clarkzinzow · 2022-02-23T08:21:06Z

Interesting, I tried reading Parquet data and many different operation combinations and wasn't able to reproduce.

We can insert the copy as a workaround for that version, and file a bug upstream.

I'll time-box an investigation this week and otherwise do as you suggest. Thank you for finding a repro!

clarkzinzow · 2022-03-07T18:07:40Z

After some debugging, it looks like this is the same bug around serializing Arrow array slice-views on an underlying larger buffer. The new issue here is that Arrow's Parquet reader is creating chunked arrays whose chunks are each slice-views on an underlying contiguous buffer containing the entire chunked array, so when the chunked array is serialized, serializing the N chunks results in copying the entire buffer N times.

Here is a minimal reproduction without involving Parquet (this is buggy in all Arrow versions):

In [1]: import pyarrow as pa

In [2]: base = pa.array(list(range(10000000)))

In [3]: chunked = pa.chunked_array([pa.Array.from_buffers(pa.int64(), 1000000, [None, base.buffers()[1]], offset=i * 1000000) for i in range(10)])

In [4]: len(pickle.dumps(base))
Out[4]: 80000148

In [5]: len(pickle.dumps(chunked))
Out[5]: 800000829

The change in Arrow 7.0.0 is that the chunked array created for a given column when reading Parquet files is all pointing at a single contiguous buffer; in past versions, a different backing buffer was created for each chunk.

In [5]: t = pa.table({"a": list(range(10000000))})
In [6]: pq.write_table(t, "test.parquet")
In [7]: t2 = pq.read_table("test.parquet")

In [8]: for chunk in t2["a"].chunks:
     ...:     print(chunk.buffers()[1].address)
     ...:
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592
140413841182592

Here you can see that Arrow 6.0.1 created a different backing buffer for each chunk:

$ pip install pyarrow==6.0.1
$ ipython

In [1]: t = pq.read_table("test.parquet")

In [2]: for chunk in t["a"].chunks:
   ...:     print(chunk.buffers()[1].address)
   ...:
140596942995904
140596912532864
140596746389440
140596729610240
140596708642304
140596717033152
140596683476288
140596691865664
140596654113152
140596662507200

Possible Solutions

Detect these slice-view chunks at read time and copy each slice into a new buffer (eager full copy of column).
Register our own serialization hook for Arrow arrays that does the slice copy given in (1), but JIT before serialization and generically for all Arrow arrays (should allow us to eliminate some defensive copying that we're doing elsewhere in Datasets).
Help upstream the actual fix to Arrow.

I'd vote for pushing on (2) and (3) in parallel, with (2) being a short-term solution that should transparently apply to all cases, and (3) being the long-term fix.

ericl · 2022-03-07T20:54:44Z

Great find. For (2), I think that sounds good if the implementation is fairly simple. Otherwise, we can add a smaller fix with the read logic.

For (3) that also sounds good; we should at least ping that this is impacting downstream users.

scv119 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 11, 2022

scv119 added this to the Datasets GA milestone Feb 11, 2022

scv119 assigned clarkzinzow Feb 11, 2022

scv119 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 11, 2022

scv119 changed the title ~~[Dataset] ray serialization failed to seal pyarrow7.0.0 Tables~~ [Dataset] ray failed to serialize pyarrow7.0.0 Tables Feb 11, 2022

clarkzinzow mentioned this issue Mar 8, 2022

[Datasets] Arrow 7.0.0+ Support: Register custom serializer for slice views on Arrow arrays. #22891

Closed

14 tasks

clarkzinzow added data Ray Data-related issues size:medium labels Apr 18, 2022

clarkzinzow added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Apr 28, 2022

hungcs mentioned this issue Apr 29, 2022

fix: Pin pyarrow=6.0.1 ludwig-ai/ludwig#1988

Merged

clarkzinzow added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Jun 27, 2022

clarkzinzow mentioned this issue Sep 7, 2022

[data] Re-init CSV invalid row handler to solve serialization issue #28327

Merged

7 tasks

matthewdeng mentioned this issue Sep 8, 2022

Issue on page /data/examples/nyc_taxi_basic_processing.html #27738

Closed

clarkzinzow mentioned this issue Oct 4, 2022

[Datasets] Arrow 7.0.0+ Support: Use Arrow IPC format for pickling Arrow data to circumvent slice view buffer truncation bug. #29055

Merged

7 tasks

clarkzinzow closed this as completed in #29055 Oct 6, 2022

matthewdeng modified the milestones: Datasets GA, Ray 2.1 Oct 7, 2022

matthewdeng added this to the Ray 2.1 milestone Oct 7, 2022

clarkzinzow mentioned this issue Oct 28, 2022

[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

Closed

tgaddair mentioned this issue Mar 1, 2023

Unpin pyarrow ludwig-ai/ludwig#3167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] ray failed to serialize pyarrow7.0.0 Tables #22310

[Dataset] ray failed to serialize pyarrow7.0.0 Tables #22310

scv119 commented Feb 11, 2022

clarkzinzow commented Feb 19, 2022

ericl commented Feb 23, 2022 •

edited

Loading

ericl commented Feb 23, 2022

clarkzinzow commented Feb 23, 2022

clarkzinzow commented Mar 7, 2022 •

edited

Loading

ericl commented Mar 7, 2022 •

edited

Loading

[Dataset] ray failed to serialize pyarrow7.0.0 Tables #22310

[Dataset] ray failed to serialize pyarrow7.0.0 Tables #22310

Comments

scv119 commented Feb 11, 2022

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

clarkzinzow commented Feb 19, 2022

ericl commented Feb 23, 2022 • edited Loading

ericl commented Feb 23, 2022

clarkzinzow commented Feb 23, 2022

clarkzinzow commented Mar 7, 2022 • edited Loading

Possible Solutions

ericl commented Mar 7, 2022 • edited Loading

ericl commented Feb 23, 2022 •

edited

Loading

clarkzinzow commented Mar 7, 2022 •

edited

Loading

ericl commented Mar 7, 2022 •

edited

Loading