Replaced `RecordBatch` by `Chunk` #717

jorgecarleitao · 2021-12-28T06:46:39Z

closed #673

This is a major refactor to the crates' IO interfaces, see #673 for details.

This PR:

Replaces RecordBatch by a new struct, Chunk, containing a vec of arrays with the same length. All IO interfaces now use Chunk and behave as follows:

read or infer schema (logical plane)
read columns (physical plane)

This allows users to not have to "leak" logical information to the physical plane unless necessary by the format.

All IO APIs were refactored to read and write Chunk (instead of RecordBatch). This removes much of the boilerplate to write a file.

Migration path

RecordBatch -> Chunk<Arc<dyn Array>>
RecordBatch::num_rows() -> Chunk::len()
RecordBatch::columns() -> Chunk::columns()
RecordBatch::column(i) -> Chunk::columns()[i]
RecordBatch::num_columns() -> Chunk::columns().len()
RecordBatch::schema() -> no longer present. Use other APIs (usually metadata) to get the schema

codecov · 2021-12-28T06:53:25Z

Codecov Report

Merging #717 (743b0da) into main (ef7937d) will increase coverage by 0.11%.
The diff coverage is 68.10%.

@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
+ Coverage   70.25%   70.37%   +0.11%     
==========================================
  Files         312      311       -1     
  Lines       17016    16920      -96     
==========================================
- Hits        11955    11907      -48     
+ Misses       5061     5013      -48

Impacted Files	Coverage Δ
benches/filter_kernels.rs	`0.00% <ø> (ø)`
src/array/list/mutable.rs	`74.28% <0.00%> (-2.19%)`	⬇️
src/compute/filter.rs	`52.85% <0.00%> (-0.77%)`	⬇️
src/compute/merge_sort/mod.rs	`87.36% <ø> (ø)`
src/compute/sort/lex_sort.rs	`68.42% <ø> (ø)`
src/datatypes/mod.rs	`97.22% <ø> (+15.82%)`	⬆️
src/io/csv/read/deserialize.rs	`100.00% <ø> (ø)`
src/io/csv/read_async/deserialize.rs	`100.00% <ø> (ø)`
src/io/flight/mod.rs	`0.00% <0.00%> (ø)`
src/io/ipc/write/stream_async.rs	`55.55% <0.00%> (+0.29%)`	⬆️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef7937d...743b0da. Read the comment docs.

jorgecarleitao · 2021-12-28T07:54:55Z

Renamed to Chunk based on @sundy-li 's suggestion: #673 (comment)

jorgecarleitao · 2022-01-02T12:55:03Z

@yjshen , @houqp , @sundy-li could you take a look at this?

I envision some pain with this PR in datafusion, as datafusion currently passes logical information (Schema) down to the physical nodes.

Because this PR requires less information to write, one way to go is to declare in DataFusion

pub struct RecordBatch {
     pub columns: Chunk<Arc<dyn Array>>;
     pub schema: Arc<Schema>;

and pass columns to the interfaces (this schema is now useless from arrow2's perspective, since the schema is known before the first batch is available).

For reading, likewise the schema is always known prior to start reading the first batch. Thus, we can just store an Arc<Schema> after reading the metadata/infering the schema and clone it for every batch that comes from IO.

yjshen · 2022-01-03T07:24:35Z

The new Chunk API LGTM.

jorgecarleitao added the backwards-incompatible label Dec 28, 2021

jorgecarleitao force-pushed the record branch from 2c13a33 to a0358be Compare December 28, 2021 07:05

jorgecarleitao changed the title ~~Replaced RecordBatch by Columns~~ Replaced RecordBatch by Chunk Dec 28, 2021

jorgecarleitao force-pushed the record branch 12 times, most recently from 10b7f4f to 8cffcb1 Compare December 31, 2021 18:14

jorgecarleitao marked this pull request as ready for review December 31, 2021 19:04

jorgecarleitao added 7 commits January 2, 2022 12:41

Removed RecordBatch from avro

ab840d7

Removed RecordBatch from csv

b0c9b9a

Removed RecordBatch from json io

22fed80

Migrated more

2b868e5

Removed RetcordBatch

7452336

Borrow -> AsRef

9b4c4fd

Simplified IPC

99ca2c8

jorgecarleitao force-pushed the record branch from 8cffcb1 to ea041be Compare January 2, 2022 12:45

Renamed Columns -> Chunk

743b0da

jorgecarleitao force-pushed the record branch from ea041be to 743b0da Compare January 2, 2022 12:46

jorgecarleitao mentioned this pull request Jan 2, 2022

Ergonomic field and schema creation with Metadata apache/arrow-rs#1091

Closed

jorgecarleitao merged commit 9b54146 into main Jan 3, 2022

jorgecarleitao deleted the record branch January 3, 2022 22:04

sundy-li mentioned this pull request Jan 4, 2022

Added cargo check to benchmarks #730

Merged

jorgecarleitao mentioned this pull request Jan 10, 2022

Discussion: Switch DataFusion to using arrow2? apache/datafusion#1532

Closed

alamb mentioned this pull request Jan 18, 2022

Officially maintained Arrow2 branch apache/datafusion#1556

Merged

nmandery mentioned this pull request Jan 20, 2022

Upgrade to polars 0.19 and arrow2 0.9 nmandery/rout3serv#20

Closed

houqp mentioned this pull request Jan 24, 2022

ARROW2: Implement RecordBatch within Datafusion apache/datafusion#1656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced `RecordBatch` by `Chunk` #717

Replaced `RecordBatch` by `Chunk` #717

jorgecarleitao commented Dec 28, 2021 •

edited

Loading

codecov bot commented Dec 28, 2021 •

edited

Loading

jorgecarleitao commented Dec 28, 2021

jorgecarleitao commented Jan 2, 2022 •

edited

Loading

yjshen commented Jan 3, 2022

Replaced RecordBatch by Chunk #717

Replaced RecordBatch by Chunk #717

Conversation

jorgecarleitao commented Dec 28, 2021 • edited Loading

Migration path

codecov bot commented Dec 28, 2021 • edited Loading

Codecov Report

jorgecarleitao commented Dec 28, 2021

jorgecarleitao commented Jan 2, 2022 • edited Loading

yjshen commented Jan 3, 2022

Replaced `RecordBatch` by `Chunk` #717

Replaced `RecordBatch` by `Chunk` #717

jorgecarleitao commented Dec 28, 2021 •

edited

Loading

codecov bot commented Dec 28, 2021 •

edited

Loading

jorgecarleitao commented Jan 2, 2022 •

edited

Loading