Add ChunkReader::get_bytes #2478

tustvold · 2022-08-17T11:21:08Z

Which issue does this PR close?

Part of #2463

Rationale for this change

This allows for zero-copy slicing when reading from bytes with an offset index.

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2022-08-17T11:22:37Z

parquet/src/file/serialized_reader.rs

-                            read
-                        ));
-                    }
+                    let buffer = self.reader.get_bytes(front.offset as u64, page_len)?;


We can only do this when we have an offset index, as we need to know the size of the page to read. There is a question over whether we could just eagerly fetch the entire column chunk in the latter case, this needs some investigation. It would drastically simplify a lot of the code (it would eliminate FileSource)

Edit: Updated #1163 (comment)

Cool ! like we call multi-times skip_rows in one page, this should eagerly fetch the entire column in memory.

If a page is small enough like 1 mb, i guess there is no defect when using eagerly fetch the entire column. looking forward the investigation result!👍

…bytes

alamb

Makes sense to me

ursabot · 2022-08-17T16:02:48Z

Benchmark runs are scheduled for baseline = 6d0ea90 and contender = e80656f. e80656f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add ChunkReader::get_bytes

07fe5a2

tustvold commented Aug 17, 2022

View reviewed changes

tustvold mentioned this pull request Aug 17, 2022

Use Standard Library IO Abstractions in Parquet #1163

Closed

Use get_bytes in parse_metadata

96d47c7

github-actions bot added the parquet Changes to the parquet crate label Aug 17, 2022

tustvold added 2 commits August 17, 2022 14:38

Merge remote-tracking branch 'upstream/master' into chunk-reader-get-…

7340273

…bytes

Add get_bytes to ColumnChunkData

c25f1ca

alamb approved these changes Aug 17, 2022

View reviewed changes

tustvold merged commit e80656f into apache:master Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ChunkReader::get_bytes #2478

Add ChunkReader::get_bytes #2478

tustvold commented Aug 17, 2022

tustvold Aug 17, 2022 •

edited

Loading

Ted-Jiang Aug 18, 2022 •

edited

Loading

Ted-Jiang Aug 18, 2022

alamb left a comment

ursabot commented Aug 17, 2022

Add ChunkReader::get_bytes #2478

Add ChunkReader::get_bytes #2478

Conversation

tustvold commented Aug 17, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

Ted-Jiang Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Ted-Jiang Aug 18, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Aug 17, 2022

tustvold Aug 17, 2022 •

edited

Loading

Ted-Jiang Aug 18, 2022 •

edited

Loading