-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new configuration item listing_table_ignore_subdirectory
#8565
Conversation
@@ -109,7 +109,7 @@ mod tests { | |||
.read_parquet( | |||
// it was reported that when a path contains // (two consecutive separator) no files were found | |||
// in this test, regardless of parquet_test_data() value, our path now contains a // | |||
format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add regex matching to directories?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Asura7969 -- this code looks good to me, but I think we need some test coverage (I am not sure the test in this PR covers the changes)
Could you maybe create a sqllogictest test for this? So like write three files:
Like this
table/1.parquet
table/subdir/2.parquet
table/3.parquet
And then verify that with ignore_subdirectory
to true (default) only the rows from 1.parquet
and 3.parquet
appear in the output but when the setting is set to false then the rows from all three files are present?
There is an example of creating and reading parquet files in a sqllogictest in these two files:
https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/parquet.slt (👋 @hiltontj -- we are already using it!)
@@ -109,7 +109,7 @@ mod tests { | |||
.read_parquet( | |||
// it was reported that when a path contains // (two consecutive separator) no files were found | |||
// in this test, regardless of parquet_test_data() value, our path now contains a // | |||
format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this question
@@ -424,6 +433,13 @@ mod tests { | |||
let b = ListingTableUrl::parse("../bar/./foo/../baz").unwrap(); | |||
assert_eq!(a, b); | |||
assert!(a.prefix.as_ref().ends_with("bar/baz")); | |||
|
|||
let url = ListingTableUrl::parse("../foo/*.parquet").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am probably missing something here, but how does this test the new code? I don't see it passing in ignore_subdirectory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's really not obvious here (actually in ListingTableUrl.contains
), I would create a sqllogictest as you suggested
@@ -150,6 +150,7 @@ datafusion.execution.aggregate.scalar_update_factor 10 | |||
datafusion.execution.batch_size 8192 | |||
datafusion.execution.coalesce_batches true | |||
datafusion.execution.collect_statistics false | |||
datafusion.execution.ignore_subdirectory true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a name that gives some context about when the ignore_subdirectory
is actually used?
For example, maybe like listing_table_ignore_subdirectory
. Or maybe even it is time to create a whole new category of configuration for listing tables datafusion.listing_table.ignore_subdirectory
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used listing_table_ignore_subdirectory
--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false | ||
|
||
query I | ||
select count(*) from partition_csv_table; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this test is testing -- in both cases the table has 4 rows (aka there is no data in a subdirectory to ignore, right)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading a partitioned table, listing_table_ignore_subdirectory
is always equal to false, even if set to true,But this test seems a bit redundant, i will clean up
12 | ||
|
||
query I | ||
select count(*) from t2_with_subdirectory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is cool to see the different rows but I don't understand the need for all the different tables and explain plans
I think we can get coverage by simply creating the equivalent of t2_with_subdirectory
and showing that it returns 12 rows when
set datafusion.execution.listing_table_ignore_subdirectory = false;
And 9 when
set datafusion.execution.listing_table_ignore_subdirectory = true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion, I understand
# Conflicts: # datafusion/sqllogictest/test_files/parquet.slt
/// When scanning file paths, whether to ignore subdirectory files, | ||
/// ignored by default (true), when reading a partitioned table, | ||
/// `listing_table_ignore_subdirectory` is always equal to false, even if set to true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the description information, do you agree? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is trying to say 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## read partition table
CREATE EXTERNAL TABLE csv_with_timestamps (
name VARCHAR,
ts TIMESTAMP,
c_date DATE,
)
STORED AS CSV
PARTITIONED BY (c_date)
LOCATION '../core/tests/data/partitioned_table';
set datafusion.execution.listing_table_ignore_subdirectory = true;
select count(*) from partition_tbale; ## return 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it -- thank you -- I will propose a clarification in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Asura7969 -- this looks great
@@ -276,6 +276,39 @@ LIMIT 10; | |||
0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC")) | |||
0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC")) | |||
|
|||
# Test config listing_table_ignore_subdirectory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
ignore_subdirectory
listing_table_ignore_subdirectory
Which issue does this PR close?
Closes #8524.
Rationale for this change
Consistent behavior with duckdb and hive
What changes are included in this PR?
add config
listing_table_ignore_subdirectory
When scanning file paths, whether to ignore subdirectory files, ignored by default (true)
NOTE: When reading a partitioned table,
listing_table_ignore_subdirectory
is always equal to false, even if set to trueAre these changes tested?
add sqllogictest
Are there any user-facing changes?
ListingTableUrl::contains
method add parameters