-
Notifications
You must be signed in to change notification settings - Fork 221
Conversation
Codecov Report
@@ Coverage Diff @@
## main #717 +/- ##
==========================================
+ Coverage 70.25% 70.37% +0.11%
==========================================
Files 312 311 -1
Lines 17016 16920 -96
==========================================
- Hits 11955 11907 -48
+ Misses 5061 5013 -48
Continue to review full report at Codecov.
|
2c13a33
to
a0358be
Compare
RecordBatch
by Columns
RecordBatch
by Chunk
Renamed to |
10b7f4f
to
8cffcb1
Compare
@yjshen , @houqp , @sundy-li could you take a look at this? I envision some pain with this PR in datafusion, as datafusion currently passes logical information ( Because this PR requires less information to write, one way to go is to declare in DataFusion pub struct RecordBatch {
pub columns: Chunk<Arc<dyn Array>>;
pub schema: Arc<Schema>; and pass For reading, likewise the schema is always known prior to start reading the first batch. Thus, we can just store an |
The new Chunk API LGTM. |
closed #673
This is a major refactor to the crates' IO interfaces, see #673 for details.
This PR:
Replaces
RecordBatch
by a new struct,Chunk
, containing a vec of arrays with the same length. All IO interfaces now useChunk
and behave as follows:This allows users to not have to "leak" logical information to the physical plane unless necessary by the format.
All IO APIs were refactored to read and write
Chunk
(instead ofRecordBatch
). This removes much of the boilerplate to write a file.Migration path
RecordBatch
->Chunk<Arc<dyn Array>>
RecordBatch::num_rows()
->Chunk::len()
RecordBatch::columns()
->Chunk::columns()
RecordBatch::column(i)
->Chunk::columns()[i]
RecordBatch::num_columns()
->Chunk::columns().len()
RecordBatch::schema()
-> no longer present. Use other APIs (usually metadata) to get the schema