-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate remaining parquet config options into ConfigOptions #3885
Conversation
There are still some tests and code updates required for this PR |
23a6614
to
2addab3
Compare
2addab3
to
8180c3a
Compare
@@ -52,59 +57,70 @@ pub const DEFAULT_PARQUET_EXTENSION: &str = ".parquet"; | |||
/// The Apache Parquet `FileFormat` implementation | |||
#[derive(Debug)] | |||
pub struct ParquetFormat { | |||
enable_pruning: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here was one copy of (some) of the settings
@@ -1142,8 +1146,6 @@ pub struct SessionConfig { | |||
/// Should DataFusion repartition data using the partition keys to execute window functions in | |||
/// parallel using the provided `target_partitions` level | |||
pub repartition_windows: bool, | |||
/// Should DataFusion parquet reader using the predicate to prune data | |||
pub parquet_pruning: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a second copy of one of the settings
@@ -168,56 +170,31 @@ pub struct ParquetReadOptions<'a> { | |||
pub file_extension: &'a str, | |||
/// Partition Columns | |||
pub table_partition_cols: Vec<String>, | |||
/// Should DataFusion parquet reader use the predicate to prune data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a third copy, again of some of the settings
@@ -702,8 +702,11 @@ async fn show_all() { | |||
"| datafusion.execution.coalesce_batches | true |", | |||
"| datafusion.execution.coalesce_target_batch_size | 4096 |", | |||
"| datafusion.execution.parquet.enable_page_index | false |", | |||
"| datafusion.execution.parquet.metadata_size_hint | NULL |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaults are now even clearer!
@@ -141,6 +139,13 @@ async fn schema_merge_can_preserve_metadata() { | |||
let table_path = table_dir.to_str().unwrap().to_string(); | |||
|
|||
let ctx = SessionContext::new(); | |||
|
|||
// explicitly disable schema clearing | |||
ctx.config_options() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we will have a nicer API for this: #3908
datafusion/core/src/config.rs
Outdated
bytes of the parquet file optimistically. If not specified, two read are required: \ | ||
One read to fetch the 8-byte parquet footer and \ | ||
another to fetch the metadata length encoded in the footer.", | ||
DataType::Boolean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- it should be u64 -- fixed in e5e3be0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really struggling to review this PR as I'm honestly lost trying to understand the design of the config system...
We used to have just RuntimeConfig
/ RuntimeEnv
for process global settings, SessionConfig
/SessionContext
for session local settings i.e. planning, and TaskContext
for physical plan execution settings
I'm not entirely clear on how ConfigOptions
fits into this, what scope it has, etc... It appears to have been introduced by @andygrove in #2754 with the intention articulated on #2756 to use it for SessionContext
Plumbing this down into the physical plans therefore feels a bit like it is coupling together parts of the system that shouldn't be coupled together? I would have expected the physical operators to only be exposed to TaskContext
, with SessionContext
solely used for planning.
I'm not saying the approach in this PR is wrong, but I don't know what correct looks like 😅
} | ||
} | ||
} | ||
|
||
impl<'a> ParquetReadOptions<'a> { | ||
/// Specify parquet_pruning | ||
pub fn parquet_pruning(mut self, parquet_pruning: bool) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this represent a functionality loss, previously you could configure these options on per-datasource basis. Now they are located on ConfigOptions which is session global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "kind of" -- previously you could configure the options on a per-datasource basis but depending on exactly what codepath you used and what other options were set your settings might or might not get overridden.
Yes, I agree it is very confusing. Would you find it less confusing if the physical planner got a
I think the TaskContext has So in that view of things your expectation "expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning." does hold In my mind |
Here is my plan: I will write up a ticket explaining the current state of affairs in terms of config (and I will make a diagram) and we can use that to come to consensus on where we want to head. Given the current mess of parquet options (has 3 other places to place config values in addition to the SessonConfig / ConfigOptions I think this PR is an improvement anyways. We'll see if @andygrove has any thoughts |
Marking as a draft until I get a writeup and some consensus |
Here is a writeup of a way to improve the configuration situation somewhat #4349 |
One read to fetch the 8-byte parquet footer and \ | ||
another to fetch the metadata length encoded in the footer.", | ||
DataType::UInt64, | ||
ScalarValue::UInt64(None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to default this something? Back in the day we would default to reading the last 64k to try and capture the entire footer in a single read which seems like a sensible default (especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64K seems reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon more thought I would like to propose we do the "consolidate config code" and "change default value" in separate PRs (which I think will be easier to see / review once the configuration is consolidated)
FWIW My plan for this PR is:
|
Updated version in #4427 |
Draft as it builds on #3822Which issue does this PR close?
Closes #3821
Rationale for this change
It turns out options for reading parquet files were able to be set (and possibly) overridden by no less than three different structures! This is confusing, to say the least.
This is especially important to us as we move to use more and more of the parquet code directly in IOx (e.g. https://github.com/influxdata/influxdb_iox/issues/5887 and https://github.com/influxdata/influxdb_iox/issues/5897)
This also helps towards #3887
What changes are included in this PR?
Are there any user-facing changes?
The main change is that now the parquet reader settings are session wide.
Previously, depending on which of the APIs was used to create / register / run parquet, the settings might change if you change the session config or they might have been a snapshot based on when you registered the reader