[Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs

I'll take this one on. 

While we're efficiently constructing individual NumPy arrays for pandas, even in the zero-copy case pandas.DataFrame will perform an extra memory copy and consolidation step internally at the end. 

This is particular to the pandas 0.x/1.x memory layout, and will change in the future with pandas 2.0, but that's quite a ways off from wide use. 

We can avoid this overhead for now by

- computing the exact internal "block" structure of the DataFrame. Since we know the null counts of the Arrow data, we can determine if type casts to accommodate nulls are necessary up front

- pre-allocating empty column-major blocks

- writing out into the block slices

- construct DataFrame from blocks with zero copy

**Reporter**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-432) / @wesm
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-432) / @wesm
#### Related issues:
- [[Python] Deserialize from Arrow record batches to pandas in parallel using a thread pool](https://github.com/apache/arrow/issues/16076) (is related to)

<sub>**Note**: *This issue was originally created as [ARROW-432](https://issues.apache.org/jira/browse/ARROW-432). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs #16079

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs #16079

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions