Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4365: [Rust] [Parquet] Implement arrow record reader. #4292

Closed
wants to merge 10 commits into from

Conversation

liurenjie1024
Copy link
Contributor

RecordReader reads logical records into memory, this is the prerequisite for ColumnReader.

@liurenjie1024
Copy link
Contributor Author

@sunchao @nevi-me @andygrove Please help to review this.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me from a coding point of view but I'm not familiar enough with parquet format internals to know how correct this is.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liurenjie1024, I'm also not very familiar with the parquet code base (planning on spending time on it in the coming weeks). I just have style changes and 1 comment.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 for the PR. Still reading through it but I'll need more time to think about the overall design and how we can align this with other parts of the project.

Seems this references the C++ version of RecordReader implementation which allows us to handle nested records with repetition level > 0 (although seems C++ still doesn't handle list type at the moment). In the Rust implementation though, we've already have record/reader.rs which can handle nested structures and is built on top of the column readers. Thus, I'm wondering if is possible to utilize the existing code (esp. triplet.rs) and implement the Arrow record reader on top of it. Just some food for thought :)

Also cc @sadikovi who originally wrote this part and might be interested.

@sunchao
Copy link
Member

sunchao commented May 14, 2019

Also we should change the title of this PR: Implement Arrow record reader.

@liurenjie1024 liurenjie1024 changed the title ARROW-4365: [Rust] [Parquet] Implement record reader. ARROW-4365: [Rust] [Parquet] Implement arrow record reader. May 17, 2019
@liurenjie1024
Copy link
Contributor Author

@sunchao Thanks for the review. Yes you are right, this is heavily inspired by c++ implementation. Currently c++ implementation can handle List<List<List> type, I'll extend this to handle more cases. Also I took a look at triple.rs. Though it's similar to record reader, they serve different purposes. The main goal of record reader is to help arrow array reader to find the boundaries of nested records and buffer them, while triple.rs is much simpler.

@sadikovi
Copy link
Contributor

@liurenjie1024 Ouch.

The list collection is actually done in reader.rs. triplet.rs's goal is producing triplets of spaced values, accounting for definition levels. I am not familiar with arrow reader. Can c++ code handle List<List<List<List<List<...>>>>? It should not matter how many nested levels it has, the algorithm should just work (which might suggest that c++ code might be missing something).

I looked at the code, could you add more documentation around it, otherwise it is quite difficult to understand what is going on? Also would be good to add more tests around different complex schemas including deeply nested nullable lists and maps.

Thanks!

@liurenjie1024
Copy link
Contributor Author

@sadikovi
Currently c++ code can handle any nested level of list, I just didn't express it clearly.
Just took a look at reader.rs. What record_reader.rs does for arrow_reader(which I'll send a PR later after this) is similar to what triplet.rs does for reader.rs. Seems like this name record_reader is misleading, it should be called something like record_splitter.
In fact I have tests for nullable lists, but not maps because currently arrow rust implementation doesn't support map. More nested lists may not help because the main logic is of splitting records are the same.

@liurenjie1024
Copy link
Contributor Author

@sunchao @sadikovi The following code is from primitive array reader (WIP: read parquet column into arrow primitive array), it demonstrates usage of record reader.

impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
    fn next_batch(&mut self, batch_size: usize) -> Result<Arc<Array>> {
        self.record_reader.reset()?;

        let mut records_read = 0usize;
        while records_read < batch_size {
            let records_to_read = batch_size - records_read;

            let records_read_once = self.record_reader.read_records(records_to_read)?;
            records_read = records_read + records_read_once;

            // Record reader exhausted
            if records_read_once < records_to_read {
                if let Some(page_reader) = self.pages.next() {
                    // Read from new page reader
                    self.record_reader.set_page_reader(page_reader?)?;
                } else {
                    // Page reader also exhausted
                    break;
                }
            }
        }

       // Convert buffer data in record_reader to array
       let record_data = self.record_reader.consume_record_data();
       let null_bitmap = self.consume_bitmap();
       .....
    }
}

@sunchao
Copy link
Member

sunchao commented May 24, 2019

Sorry for late review @liurenjie1024 !. I'll take another thorough look on this in the weekend.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 and sorry again for the late review. I guess that it's hard to reuse triplet.rs in this case.

Does the current record reader only handles the top-level record? i.e., the outer most List in a List<List<..>> structure. Curious how this can be used to assemble the inner lists in a multi-level list structure.

Left some comments - mostly minor and about naming/style stuff.

@liurenjie1024
Copy link
Contributor Author

@sunchao Thanks for the review.
It seems that it's a little difficult to understand how this works without a full picture of the implementation. I'll submit a PR for initial version of arrow reader later.

@liurenjie1024
Copy link
Contributor Author

@sunchao @sadikovi
I've completed most part of arrow reader and code is here: https://github.com/liurenjie1024/arrow/tree/arrow-4059/rust/parquet/src/arrow
Hope it will help reviewers to understand what record_reader.rs is doing.
I don't submit the whole reader in a huge PR because I think small PRs is more friendly for reviewing, so please take a look at this.

@sunchao
Copy link
Member

sunchao commented Jun 27, 2019

Nice @liurenjie1024 ! will take a look at this.

@codecov-io
Copy link

codecov-io commented Jul 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@f35dd92). Click here to learn what that means.
The diff coverage is 89.76%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4292   +/-   ##
=========================================
  Coverage          ?   82.73%           
=========================================
  Files             ?       87           
  Lines             ?    25430           
  Branches          ?        0           
=========================================
  Hits              ?    21040           
  Misses            ?     4390           
  Partials          ?        0
Impacted Files Coverage Δ
rust/parquet/src/arrow/record_reader.rs 89.76% <89.76%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f35dd92...874bb23. Read the comment docs.

@sunchao
Copy link
Member

sunchao commented Jul 15, 2019

@liurenjie1024 thanks again for the POC (sorry for being late as I was on vacation). I think I have a much better understanding on this now. Could you address the comments I had and update this PR?

@liurenjie1024
Copy link
Contributor Author

@sunchao Sure, will fix it soon

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@liurenjie1024
Copy link
Contributor Author

@sunchao Fixed comments. Please take a look when you got a chance.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 ! this overall looks good to me - just left some comments & questions. Also could you check the CI failure?

})?;

// Fill spaces in column data with default values
let mut data_pos = data_read;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code, if data_read is 0 you may still enter the loop right? and same thing will happen?

data_buf.swap(level_pos - 1, data_pos - 1);
data_pos -= 1;
} else {
data_buf[level_pos - 1] = T::T::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you mean some latter processing relies on the fact that we should default value as garbage value?

@liurenjie1024
Copy link
Contributor Author

@sunchao Thanks for the review. Fixed comments. Will take a look at the ci error later.

@sunchao
Copy link
Member

sunchao commented Jul 30, 2019

Thanks @liurenjie1024 . The PR looks good to me except one minor thing: can you rename records_num to num_records in the RecordReader struct? After the CI issue is resolved I think the PR can be merged.

@liurenjie1024
Copy link
Contributor Author

@sunchao CI passed. I think the c++ build failure is unrelated.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao sunchao closed this in 3112d34 Jul 30, 2019
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Aug 11, 2019
RecordReader reads logical records into memory, this is the prerequisite for ColumnReader.

Closes apache#4292 from liurenjie1024/arrow-4365 and squashes the following commits:

874bb23 <Renjie Liu> Fix ci
0c396c7 <Renjie Liu> Fix comments
138fabe <Renjie Liu> Fix one method
e279a00 <Renjie Liu> Fix comments
19db497 <Renjie Liu> fix build break
e6c4970 <Renjie Liu> Fix style
8295a28 <Renjie Liu> Use backtick instead of triple
e1da426 <Renjie Liu> Fix comments
5c8b6b3 <Renjie Liu> Fix format error
1b288a5 <Renjie Liu> Implement record reader.

Authored-by: Renjie Liu <liurenjie2008@gmail.com>
Signed-off-by: Chao Sun <sunchao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants