-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: add datafusion based parquet reader #312
Conversation
2a50baf
to
b303920
Compare
e38659a
to
4f4de4f
Compare
13d99c0
to
c557007
Compare
6115035
to
6243fcf
Compare
@@ -264,7 +264,14 @@ mod tests { | |||
}; | |||
|
|||
let mut reader = ParquetSstReader::new(&sst_file_path, &store, &sst_reader_options); | |||
assert_eq!(reader.meta_data().await.unwrap(), &sst_meta); | |||
let sst_meta_readback = { | |||
// size of SstMetaData is not what this file's size, so overwrite it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add FIXME:
prefix.
|
||
use super::encoding::{self, ParquetDecoder}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid the relative importing path.
self.metadata_size_hint, self.cache_hit, self.cache_miss, self.metrics.bytes_scanned.value() | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Miss one newline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
common_types/src/time.rs
Outdated
|
||
/// Creates expression like: | ||
/// start <= time && time < end | ||
pub fn df_expr(&self, column_name: impl AsRef<str>) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn df_expr(&self, column_name: impl AsRef<str>) -> Expr { | |
pub fn to_df_expr(&self, column_name: impl AsRef<str>) -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c1f7c51
to
6d3851d
Compare
6d3851d
to
1427794
Compare
In my local environment, the performance have regression when adopt this new reader, so further investigation is required before merge this. Tested sst file: 104,022,899 rows old: 11709ms(this exclude file read, since it's read out in advance) Related issue: apache/arrow-rs#2916 |
Which issue does this PR close?
Closes #291
Rationale for this change
As described in #291, this PR also fix object store cache isn't working.
After #14, parquet reader will read all bytes out, ignoring whether if it's already cached.
What changes are included in this PR?
Replace hand-rolled parquet reader with datafusion's ParquetExec, and add
CachableParquetFileReader
to implement row-group level cacheAre there any user-facing changes?
No
How does this change test
Using existing UT