Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate remaining parquet config options into ConfigOptions #3885

Closed
wants to merge 3 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 18, 2022

Draft as it builds on #3822

Which issue does this PR close?

Closes #3821

Rationale for this change

  1. Make it easier for people to see what config options are available will make it more likely they are used
  2. The more mechanisms that configuration is supplied, the more likely it to confuse people

It turns out options for reading parquet files were able to be set (and possibly) overridden by no less than three different structures! This is confusing, to say the least.

This is especially important to us as we move to use more and more of the parquet code directly in IOx (e.g. https://github.com/influxdata/influxdb_iox/issues/5887 and https://github.com/influxdata/influxdb_iox/issues/5897)

This also helps towards #3887

What changes are included in this PR?

  1. move metadata_size_hint, enable_pruning, and merge_schema_metadata to new config options
  2. Remove redundancy and config overrides

Are there any user-facing changes?

The main change is that now the parquet reader settings are session wide.

Previously, depending on which of the APIs was used to create / register / run parquet, the settings might change if you change the session config or they might have been a snapshot based on when you registered the reader

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 18, 2022
@alamb
Copy link
Contributor Author

alamb commented Oct 18, 2022

There are still some tests and code updates required for this PR

@alamb alamb marked this pull request as draft October 18, 2022 17:30
@github-actions github-actions bot added the core Core DataFusion crate label Oct 18, 2022
@@ -52,59 +57,70 @@ pub const DEFAULT_PARQUET_EXTENSION: &str = ".parquet";
/// The Apache Parquet `FileFormat` implementation
#[derive(Debug)]
pub struct ParquetFormat {
enable_pruning: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here was one copy of (some) of the settings

@@ -1142,8 +1146,6 @@ pub struct SessionConfig {
/// Should DataFusion repartition data using the partition keys to execute window functions in
/// parallel using the provided `target_partitions` level
pub repartition_windows: bool,
/// Should DataFusion parquet reader using the predicate to prune data
pub parquet_pruning: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a second copy of one of the settings

@@ -168,56 +170,31 @@ pub struct ParquetReadOptions<'a> {
pub file_extension: &'a str,
/// Partition Columns
pub table_partition_cols: Vec<String>,
/// Should DataFusion parquet reader use the predicate to prune data,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a third copy, again of some of the settings

@@ -702,8 +702,11 @@ async fn show_all() {
"| datafusion.execution.coalesce_batches | true |",
"| datafusion.execution.coalesce_target_batch_size | 4096 |",
"| datafusion.execution.parquet.enable_page_index | false |",
"| datafusion.execution.parquet.metadata_size_hint | NULL |",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults are now even clearer!

@@ -141,6 +139,13 @@ async fn schema_merge_can_preserve_metadata() {
let table_path = table_dir.to_str().unwrap().to_string();

let ctx = SessionContext::new();

// explicitly disable schema clearing
ctx.config_options()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we will have a nicer API for this: #3908

@alamb alamb marked this pull request as ready for review October 20, 2022 16:53
bytes of the parquet file optimistically. If not specified, two read are required: \
One read to fetch the 8-byte parquet footer and \
another to fetch the metadata length encoded in the footer.",
DataType::Boolean,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- it should be u64 -- fixed in e5e3be0

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really struggling to review this PR as I'm honestly lost trying to understand the design of the config system...

We used to have just RuntimeConfig/ RuntimeEnv for process global settings, SessionConfig/SessionContext for session local settings i.e. planning, and TaskContext for physical plan execution settings

I'm not entirely clear on how ConfigOptions fits into this, what scope it has, etc... It appears to have been introduced by @andygrove in #2754 with the intention articulated on #2756 to use it for SessionContext

Plumbing this down into the physical plans therefore feels a bit like it is coupling together parts of the system that shouldn't be coupled together? I would have expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning.

I'm not saying the approach in this PR is wrong, but I don't know what correct looks like 😅

}
}
}

impl<'a> ParquetReadOptions<'a> {
/// Specify parquet_pruning
pub fn parquet_pruning(mut self, parquet_pruning: bool) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this represent a functionality loss, previously you could configure these options on per-datasource basis. Now they are located on ConfigOptions which is session global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "kind of" -- previously you could configure the options on a per-datasource basis but depending on exactly what codepath you used and what other options were set your settings might or might not get overridden.

@alamb
Copy link
Contributor Author

alamb commented Oct 24, 2022

I'm really struggling to review this PR as I'm honestly lost trying to understand the design of the config system...

Yes, I agree it is very confusing.

Would you find it less confusing if the physical planner got a SessionConfig rather than ConfigOptions?

Plumbing this down into the physical plans therefore feels a bit like it is coupling together parts of the system that shouldn't be coupled together? I would have expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning.

I think the SessionContext has a SessionState which has a SessionConfig which has a ConfigOptions 🤯

TaskContext has TaskProperties which is either a SessionConfig or some k=v pairs

So in that view of things your expectation "expected the physical operators to only be exposed to TaskContext, with SessionContext solely used for planning." does hold

In my mind ConfigOptions and SessionConfig serve the same purpose and eventually they will be unified (using the implementation of ConfigOptions.

@alamb
Copy link
Contributor Author

alamb commented Oct 24, 2022

Here is my plan: I will write up a ticket explaining the current state of affairs in terms of config (and I will make a diagram) and we can use that to come to consensus on where we want to head.

Given the current mess of parquet options (has 3 other places to place config values in addition to the SessonConfig / ConfigOptions I think this PR is an improvement anyways.

We'll see if @andygrove has any thoughts

@alamb alamb marked this pull request as draft October 25, 2022 20:06
@alamb
Copy link
Contributor Author

alamb commented Oct 25, 2022

Marking as a draft until I get a writeup and some consensus

@alamb alamb mentioned this pull request Nov 2, 2022
@alamb
Copy link
Contributor Author

alamb commented Nov 23, 2022

Here is a writeup of a way to improve the configuration situation somewhat #4349

One read to fetch the 8-byte parquet footer and \
another to fetch the metadata length encoded in the footer.",
DataType::UInt64,
ScalarValue::UInt64(None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to default this something? Back in the day we would default to reading the last 64k to try and capture the entire footer in a single read which seems like a sensible default (especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64K seems reasonable to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon more thought I would like to propose we do the "consolidate config code" and "change default value" in separate PRs (which I think will be easier to see / review once the configuration is consolidated)

@alamb
Copy link
Contributor Author

alamb commented Nov 26, 2022

FWIW My plan for this PR is:

  1. To leave any the current places to configure parquet settings per ParquetExec as overrrides (Option<val>)
  2. Expose remaining options via ConfigOptions (and make them visible, etc)

@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2022

Updated version in #4427

@alamb alamb closed this Nov 29, 2022
@alamb alamb deleted the alamb/consolidate_parquet branch August 8, 2023 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configuring parquet filter pushdown dynamically
3 participants