-
Notifications
You must be signed in to change notification settings - Fork 1k
General virtual columns support + row numbers as a first use-case #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vustef
wants to merge
50
commits into
apache:main
Choose a base branch
from
vustef:feature/parquet-virtual-row-numbers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
f93d36e
Add support for file row numbers in Parquet readers
jkylling e485c0b
Add Apache license header to row_number.rs
jkylling 2a62009
Run cargo format
jkylling fb5126f
Change with_row_number_column to take impl Into<String>
jkylling 5350728
Change Option<String> -> Option<&str> in build_array_reader
jkylling 188f350
Replace ParquetError::RowGroupMetaDataMissingRowNumber with General
jkylling 37a9d83
Split test_create_array_reader test into two
jkylling 41e38fe
first_row_number -> first_row_index
jkylling 1a1e6b6
Simplify RowNumberReader with iterators
jkylling bcad87f
Merge remote-tracking branch 'origin/main' into feature/parquet-reade…
vustef 89c1fd1
add parquet-testing change from the merge
vustef b0d53d0
Fix test_arrow_reader_all_columns
vustef 094ae81
Fix first_row_number
vustef a5858df
Rename to first_row_index consistently, remove Option.
vustef 5e7d9a1
revert parquet-testing update
vustef 54c22c6
Fix baselines in file::metadata::tests::test_memory_size
vustef f05d470
Fix encryption metadata and async tests. Those features and default f…
vustef 11e4f39
RowNumber extension type
vustef d02c977
using supplied_schema works
vustef 6fecc17
Don't modify parsing of parquet schema, virtual columns can only be a…
vustef 1414421
Reworked with_virtual_columns in options
vustef 07eb467
switch to ref to slice; cleanup with_row_number_columns; async tests …
vustef af0e0f9
Bring back optionality to first_row_index, for future consideration w…
vustef 8bccd22
Reexport
vustef 65679ba
reexport all within virtual_type
vustef 968d461
pub mod virtual_type skipping experimental schema
vustef 6144967
Switch back to `virtual_type::*` for now; fix warnings on cargo test
vustef 3af3ad7
Fix `projected_fields` assertion in async reader
vustef fad0ea1
common virtual column struct
vustef ca6c7a6
assert that column is virtual
vustef da9245d
don't change pub API
vustef 031c6d5
complex_schema rename
vustef 079a78d
passing docstring test
vustef f2a4f45
Pass parquet metadata to array reader builder
vustef 3933d8e
Add virtual fields outside of the visitor
vustef e5449e1
use parquet.virtual instead of arrow.virtual
vustef a2c55dc
more struct based approach to virtual type reuse
vustef 688ce7b
Switch to directly implementing ExtensionType for RowNumber, no commo…
vustef 8e7f668
Use FieldRef
vustef 3aeced1
row number virtual_prefix sharing
vustef 31679f1
RowNumber instead of RowNumber::default()
vustef 83a20c6
Default ordinals
vustef f22e9f5
Merge branch 'main' of github.com:apache/arrow-rs into feature/parque…
vustef 3e3b90f
merge fixes
vustef 5db9113
Fix example
vustef b00373b
cargo fmt
vustef 6651017
fix infinite loop
vustef 5ff1cc9
cargo fmt -p parquet ...
vustef 8da925c
Fix clippy too
vustef 40db3d6
fix doctest too
vustef File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -561,6 +561,7 @@ mod tests { | |
| schema, | ||
| ProjectionMask::all(), | ||
| file_metadata.key_value_metadata(), | ||
| &[], | ||
| ) | ||
| .unwrap(); | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,7 @@ use crate::arrow::record_reader::GenericRecordReader; | |
| use crate::arrow::record_reader::buffer::ValuesBuffer; | ||
| use crate::column::page::PageIterator; | ||
| use crate::column::reader::decoder::ColumnValueDecoder; | ||
| use crate::file::metadata::ParquetMetaData; | ||
| use crate::file::reader::{FilePageIterator, FileReader}; | ||
|
|
||
| mod builder; | ||
|
|
@@ -42,12 +43,13 @@ mod map_array; | |
| mod null_array; | ||
| mod primitive_array; | ||
| mod row_group_cache; | ||
| mod row_number; | ||
| mod struct_array; | ||
|
|
||
| #[cfg(test)] | ||
| mod test_util; | ||
|
|
||
| // Note that this crate is public under the `experimental` feature flag. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't remove this comment |
||
| use crate::file::metadata::RowGroupMetaData; | ||
| pub use builder::{ArrayReaderBuilder, CacheOptions, CacheOptionsBuilder}; | ||
| pub use byte_array::make_byte_array_reader; | ||
| pub use byte_array_dictionary::make_byte_array_dictionary_reader; | ||
|
|
@@ -139,17 +141,33 @@ pub trait RowGroups { | |
| /// Returns a [`PageIterator`] for all pages in the specified column chunk | ||
| /// across all row groups in this collection. | ||
| fn column_chunks(&self, i: usize) -> Result<Box<dyn PageIterator>>; | ||
|
|
||
| /// Returns an iterator over the row groups in this collection | ||
| fn row_groups(&self) -> Box<dyn Iterator<Item = &RowGroupMetaData> + '_>; | ||
|
|
||
| /// Returns the parquet metadata | ||
| fn metadata(&self) -> &ParquetMetaData; | ||
| } | ||
|
|
||
| impl RowGroups for Arc<dyn FileReader> { | ||
| fn num_rows(&self) -> usize { | ||
| self.metadata().file_metadata().num_rows() as usize | ||
| FileReader::metadata(self.as_ref()) | ||
| .file_metadata() | ||
| .num_rows() as usize | ||
| } | ||
|
|
||
| fn column_chunks(&self, column_index: usize) -> Result<Box<dyn PageIterator>> { | ||
| let iterator = FilePageIterator::new(column_index, Arc::clone(self))?; | ||
| Ok(Box::new(iterator)) | ||
| } | ||
|
|
||
| fn row_groups(&self) -> Box<dyn Iterator<Item = &RowGroupMetaData> + '_> { | ||
| Box::new(FileReader::metadata(self.as_ref()).row_groups().iter()) | ||
| } | ||
|
|
||
| fn metadata(&self) -> &ParquetMetaData { | ||
| FileReader::metadata(self.as_ref()) | ||
| } | ||
| } | ||
|
|
||
| /// Uses `record_reader` to read up to `batch_size` records from `pages` | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm yet to merge latest main, which has push decoder changes...