-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
There's an edge case in the present max_distinct_count algorithm, whereby there can be an attempt to subtract a potentially larger number of total nulls from a inexact smaller number of total rows to get the distinct values
https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L957-L959
This leads to a panic with attempt to subtract with overflow.
To Reproduce
Extract the three parquet files from files.zip needed for the repro. These were generated using DuckDB with SF=0.01 for TPC-DS benchamrks. The example below is a minimal repro for an issue observed for query 24 from that benchmark.
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let file_format = ParquetFormat::default().with_enable_pruning(Some(true));
let listing_options = ListingOptions::new(Arc::new(file_format))
.with_file_extension(FileType::PARQUET.get_ext());
ctx.register_listing_table(
"store",
&format!("/path/to/store.parquet"),
listing_options.clone(),
None,
None,
)
.await?;
ctx.register_listing_table(
"store_sales",
&format!("/path/to/store_sales.parquet"),
listing_options.clone(),
None,
None,
)
.await?;
ctx.register_listing_table(
"customer",
&format!("/path/to/customer.parquet"),
listing_options,
None,
None,
)
.await?;
let df = ctx
.sql(
"SELECT c_last_name,
c_first_name,
s_store_name,
s_state
FROM store_sales,
store,
customer
WHERE ss_customer_sk = c_customer_sk
AND ss_store_sk = s_store_sk
AND s_market_id=8",
)
.await?;
// print the results
df.show().await?;
Ok(())
}The above code panics with:
thread 'main' panicked at datafusion/physical-plan/src/joins/utils.rs:958:40:
attempt to subtract with overflowNote that you can get a repro with the cli by appending DATAFUSION_EXECUTION_COLLECT_STATISTICS=true DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 to cargo run
Expected behavior
The example shouldn't panic, but instead return an empty result.
Additional context
As for the question how does this situation even occur in the first place, from my brief investigation I'm seeing that:
- The
FilterExecfor the filtering on thestoretable predicate returnsInexact(0)as the number of rows for it's output statistics, since the predicate refutes all the input rows (in the case ofstoreabove there's only a single row withs_market_idequals to 2). - When joining
storeandstore_salesthe join cardinality estimate is 0 due to the above filtering, but the column statistic are nonetheless merged as is (meaning an exact null count for thestore_salescolumns is inherited)
https://github.com/apache/arrow-datafusion/blob/bee7136a04c60a2c06caa630cf1b72f32f7dc574/datafusion/physical-plan/src/joins/utils.rs#L848-L859 - Finally during merging with
customerthe statistics from step 2 enters into play, and when it reachesmax_distinct_countit hits thenum_rowsbeingInexact(0)butstats.null_countbeing exact and greater than zero edge case.