-
Couldn't load subscription status.
- Fork 1.7k
Description
Describe the bug
I accidentally stumbled over a bit of a foot gun while playing around with apache/datafusion-benchmarks. Working on a TPC-H dataset with a scale factor of 25 and 5 partitions I ended inadvertently discovering that there's a subtle difference on statistics collection based on whether a Parquet table was created via SessionContext::register_parquet or with CREATE EXTERNAL TABLE statements.
With this dataset, using register_parquet results in a query running for ~5s on c5.4xlarge EC2 instance against s3 while never using more than 2GiB of RAM. If I switched to using CREATE EXTERNAL TABLE SQL statements, the same query (query 18 from the TPC-H queries) would OOM the process (the c5.4xlarge has 32GiB of RAM for reference).
I managed to chase this down to the fact that when using SessionContext::register_parquet, the ListingOptions::collect_stat flag is turned on. For CREATE EXTERNAL TABLE statements, it's pulled from the datafusion.execution.collect_statistics config entry which is false by default.
This felt awkward enough that I thought to file an issue for discussion, but I'm not entirely certain on what the outcome might be other than flipping the default datafusion.execution.collect_statistics to true or updating the register_parquet method to check the session configuration. Neither of which seem like super great options given the ages of those APIs.
To Reproduce
Register the same table twice, once using SessionContext::register_parquet and once using a CREATE EXTERNAL TABLE SQL statement. Printing out the TableProvider (via SessionConetxt::table_provider) shows the collect_stat option is true for the registered table, and false for the external table.
Expected behavior
I'd expect the configurations to be the same.
Additional context
No response