Skip to content

Improve performance of array_has #18181

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While working on

@ianthetechie provided the following query which executes quite slowly:

CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';

CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';

WITH categories_arr AS (
    SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 500
)
SELECT COUNT(*)
    FROM places p
    WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));

While the regression in #18070 was fixed, there is a lot of room to improve this query's performance still

To reproduce, download slow_array_has.zip and run:

datafusion-cli -f repro.sql

60% of the overall query time is spent in array_has as can be seen by this quick profile

Image

Describe the solution you'd like

Make array_has go faster

Describe alternatives you've considered

@jayzhan211 has some ideas here in

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions