-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
While working on
@ianthetechie provided the following query which executes quite slowly:
CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';
CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
WITH categories_arr AS (
SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 500
)
SELECT COUNT(*)
FROM places p
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));While the regression in #18070 was fixed, there is a lot of room to improve this query's performance still
To reproduce, download slow_array_has.zip and run:
datafusion-cli -f repro.sql60% of the overall query time is spent in array_has as can be seen by this quick profile
Describe the solution you'd like
Make array_has go faster
Describe alternatives you've considered
@jayzhan211 has some ideas here in
Additional context
No response
joroKr21 and suibianwanwank
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request