-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4365: [Rust] [Parquet] Implement arrow record reader. #4292
Conversation
@sunchao @nevi-me @andygrove Please help to review this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me from a coding point of view but I'm not familiar enough with parquet format internals to know how correct this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liurenjie1024, I'm also not very familiar with the parquet code base (planning on spending time on it in the coming weeks). I just have style changes and 1 comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 for the PR. Still reading through it but I'll need more time to think about the overall design and how we can align this with other parts of the project.
Seems this references the C++ version of RecordReader
implementation which allows us to handle nested records with repetition level > 0 (although seems C++ still doesn't handle list type at the moment). In the Rust implementation though, we've already have record/reader.rs
which can handle nested structures and is built on top of the column readers. Thus, I'm wondering if is possible to utilize the existing code (esp. triplet.rs) and implement the Arrow record reader on top of it. Just some food for thought :)
Also cc @sadikovi who originally wrote this part and might be interested.
Also we should change the title of this PR: Implement Arrow record reader. |
@sunchao Thanks for the review. Yes you are right, this is heavily inspired by c++ implementation. Currently c++ implementation can handle List<List<List> type, I'll extend this to handle more cases. Also I took a look at triple.rs. Though it's similar to record reader, they serve different purposes. The main goal of record reader is to help arrow array reader to find the boundaries of nested records and buffer them, while triple.rs is much simpler. |
@liurenjie1024 Ouch. The list collection is actually done in I looked at the code, could you add more documentation around it, otherwise it is quite difficult to understand what is going on? Also would be good to add more tests around different complex schemas including deeply nested nullable lists and maps. Thanks! |
@sadikovi |
@sunchao @sadikovi The following code is from primitive array reader (WIP: read parquet column into arrow primitive array), it demonstrates usage of record reader. impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
fn next_batch(&mut self, batch_size: usize) -> Result<Arc<Array>> {
self.record_reader.reset()?;
let mut records_read = 0usize;
while records_read < batch_size {
let records_to_read = batch_size - records_read;
let records_read_once = self.record_reader.read_records(records_to_read)?;
records_read = records_read + records_read_once;
// Record reader exhausted
if records_read_once < records_to_read {
if let Some(page_reader) = self.pages.next() {
// Read from new page reader
self.record_reader.set_page_reader(page_reader?)?;
} else {
// Page reader also exhausted
break;
}
}
}
// Convert buffer data in record_reader to array
let record_data = self.record_reader.consume_record_data();
let null_bitmap = self.consume_bitmap();
.....
}
} |
Sorry for late review @liurenjie1024 !. I'll take another thorough look on this in the weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 and sorry again for the late review. I guess that it's hard to reuse triplet.rs
in this case.
Does the current record reader only handles the top-level record? i.e., the outer most List
in a List<List<..>>
structure. Curious how this can be used to assemble the inner lists in a multi-level list structure.
Left some comments - mostly minor and about naming/style stuff.
@sunchao Thanks for the review. |
@sunchao @sadikovi |
b77a4ba
to
3969a46
Compare
Nice @liurenjie1024 ! will take a look at this. |
Codecov Report
@@ Coverage Diff @@
## master #4292 +/- ##
=========================================
Coverage ? 82.73%
=========================================
Files ? 87
Lines ? 25430
Branches ? 0
=========================================
Hits ? 21040
Misses ? 4390
Partials ? 0
Continue to review full report at Codecov.
|
@liurenjie1024 thanks again for the POC (sorry for being late as I was on vacation). I think I have a much better understanding on this now. Could you address the comments I had and update this PR? |
@sunchao Sure, will fix it soon |
ed180da
to
85fe336
Compare
ee1ea5d
to
e279a00
Compare
@sunchao Fixed comments. Please take a look when you got a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 ! this overall looks good to me - just left some comments & questions. Also could you check the CI failure?
})?; | ||
|
||
// Fill spaces in column data with default values | ||
let mut data_pos = data_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current code, if data_read
is 0 you may still enter the loop right? and same thing will happen?
data_buf.swap(level_pos - 1, data_pos - 1); | ||
data_pos -= 1; | ||
} else { | ||
data_buf[level_pos - 1] = T::T::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you mean some latter processing relies on the fact that we should default value as garbage value?
@sunchao Thanks for the review. Fixed comments. Will take a look at the ci error later. |
Thanks @liurenjie1024 . The PR looks good to me except one minor thing: can you rename |
@sunchao CI passed. I think the c++ build failure is unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
RecordReader reads logical records into memory, this is the prerequisite for ColumnReader. Closes apache#4292 from liurenjie1024/arrow-4365 and squashes the following commits: 874bb23 <Renjie Liu> Fix ci 0c396c7 <Renjie Liu> Fix comments 138fabe <Renjie Liu> Fix one method e279a00 <Renjie Liu> Fix comments 19db497 <Renjie Liu> fix build break e6c4970 <Renjie Liu> Fix style 8295a28 <Renjie Liu> Use backtick instead of triple e1da426 <Renjie Liu> Fix comments 5c8b6b3 <Renjie Liu> Fix format error 1b288a5 <Renjie Liu> Implement record reader. Authored-by: Renjie Liu <liurenjie2008@gmail.com> Signed-off-by: Chao Sun <sunchao@apache.org>
RecordReader reads logical records into memory, this is the prerequisite for ColumnReader.