Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select count(*) from '*.parquet' scans all parquet file recursively including subdirectory #8524

Closed
zhangxffff opened this issue Dec 13, 2023 · 3 comments · Fixed by #8565
Closed
Labels
bug Something isn't working

Comments

@zhangxffff
Copy link
Contributor

Describe the bug

I find select count(*) from '*.parquet' not only scan the parquet file in current directory, but it also recursively scan all the parquet file in subdirectory. I wonder is this behavior by design or a bug.

To Reproduce

I tried with three parquet file, and two of them are in subdir.

$$ tree
.
├── subdir
│   ├── file1.parquet
│   └── file2.parquet
└── users.parquet

2 directories, 3 files

users.parquet has 2 record, file1.parquet has 1 record and file2.parquet has 1 record.
select count(*) from '*.parquet' get 4

$$ datafusion-cli -c "select count(*) from '*.parquet'"
DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 4        |
+----------+
1 row in set. Query took 0.058 seconds.

$$ datafusion-cli -c "select count(*) from 'users.parquet'"
DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 2        |
+----------+
1 row in set. Query took 0.002 seconds.

$$ datafusion-cli -c "select count(*) from 'subdir/*.parquet'"
DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 2        |
+----------+
1 row in set. Query took 0.002 seconds.

Expected behavior

I try same query in duckdb, and duckdb only scan parquet file in current directory

$$ duckdb -c "select count(*) from '*.parquet'"
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            2 │
└──────────────┘

Additional context

No response

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I think the reason this happens is that the ListingTable does ObjectStore::list which finds files in all subdirectories

I wonder is this behavior by design or a bug.

As I understand it, DataFusion is trying to model the behavior of "Hive PartitionedTables" -- so to answer this question I think we need to research what Hive does in this case

@zhangxffff
Copy link
Contributor Author

I think the reason this happens is that the ListingTable does ObjectStore::list which finds files in all subdirectories

I wonder is this behavior by design or a bug.

As I understand it, DataFusion is trying to model the behavior of "Hive PartitionedTables" -- so to answer this question I think we need to research what Hive does in this case

I tried with hive external table stores as parquet, it seems that hive external table also do not scan parquet file in subdirectory
image

as show in this picture, when location is hdfs:///user/hive/warehouse/zxf_test/, there is no data in external table, when localtion is hdfs:///user/hive/warehouse/zxf_test/subdir, external table has two records from two parquet file.
image

I also tried partitioned external table.
image

After create table, there is no data.
image

After specify location of partition pt1, we can get data from hdfs:///user/hive/warehouse/zxf_test_pt/pt1
image

After also specfy location of partition pt2, we can get data from both hdfs:///user/hive/warehouse/zxf_test_pt/pt1 and hdfs:///user/hive/warehouse/zxf_test_pt/pt2
image

If I copy a subdirectory with parquet file into hive partition directory, hive report a java.io.IOException:java.io.IOException: Not a file
image
image

So it seems that hive also do not scan parquet file in the subdirectoy. for hive partitioned table, user should specify the directory of each partition, and there should not contains any subdirectory.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

Makes sense to me. THank you for the research @zhangxffff

If we wish to change this behavior, perhaps we can add a configuration parameter to have the old or new behavior (defaulting to the new behavior)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants