Skip to content

Mismatched configuration between SessionContext::register_parquet and CREATE EXTERNAL TABLE #15908

@davisp

Description

@davisp

Describe the bug

I accidentally stumbled over a bit of a foot gun while playing around with apache/datafusion-benchmarks. Working on a TPC-H dataset with a scale factor of 25 and 5 partitions I ended inadvertently discovering that there's a subtle difference on statistics collection based on whether a Parquet table was created via SessionContext::register_parquet or with CREATE EXTERNAL TABLE statements.

With this dataset, using register_parquet results in a query running for ~5s on c5.4xlarge EC2 instance against s3 while never using more than 2GiB of RAM. If I switched to using CREATE EXTERNAL TABLE SQL statements, the same query (query 18 from the TPC-H queries) would OOM the process (the c5.4xlarge has 32GiB of RAM for reference).

I managed to chase this down to the fact that when using SessionContext::register_parquet, the ListingOptions::collect_stat flag is turned on. For CREATE EXTERNAL TABLE statements, it's pulled from the datafusion.execution.collect_statistics config entry which is false by default.

This felt awkward enough that I thought to file an issue for discussion, but I'm not entirely certain on what the outcome might be other than flipping the default datafusion.execution.collect_statistics to true or updating the register_parquet method to check the session configuration. Neither of which seem like super great options given the ages of those APIs.

To Reproduce

Register the same table twice, once using SessionContext::register_parquet and once using a CREATE EXTERNAL TABLE SQL statement. Printing out the TableProvider (via SessionConetxt::table_provider) shows the collect_stat option is true for the registered table, and false for the external table.

Expected behavior

I'd expect the configurations to be the same.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions