You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've attached two parquet files. Both files contain a single column with 131072 rows, generated from Arrow with a single record batch. The fsb16.parquet file contains a column of type FixedSizeBinary(16), the ints.parquet contains a column of type Int64.
If I do an inner join, the query returns really quickly:
❯ create external table t0 stored as parquet location 'ints.parquet';
❯ select * from t0 inner join t0 as t1 on t0.ints = t1.ints;
+--------+--------+
...[snip]...
+--------+--------+
131072 rows in set. Query took 0.530 seconds.
But if I do the same with the FixedSizeBinary(16) file, it takes a very long time, runs up a huge working set (seeing 170GB+ on my computer), and takes a long time. In much of my testing it runs out of memory and dies, but if it finishes it takes ~6 minutes (compared to 0.5s with the int64 columns)
❯ create external table t0 stored as parquet location 'fsb16.parquet';
❯ select * from t0 inner join t0 as t1 on t0.journey_id = t1.journey_id;
+----------------------------------+----------------------------------+
...[snip]...
+----------------------------------+----------------------------------+
358946 rows in set. Query took 356.370 seconds.
Also, I think the results are wrong; the result set should only have 131072 rows, not 358946
One thing I should mention is that I am testing with this patch applied to Arrow because otherwise it's significantly slower in the FixedSizeBinary(16) case: apache/arrow-rs#3793
maxburke
changed the title
Performance / correctness issues with inner joins on columns of type FixedSizeBinary(16)
Performance issue with inner joins on columns of type FixedSizeBinary(16)
Mar 3, 2023
I've attached two parquet files. Both files contain a single column with 131072 rows, generated from Arrow with a single record batch. The
fsb16.parquet
file contains a column of type FixedSizeBinary(16), theints.parquet
contains a column of typeInt64
.If I do an inner join, the query returns really quickly:
Here is the plan for the int64 query:
But if I do the same with the FixedSizeBinary(16) file, it takes a very long time, runs up a huge working set (seeing 170GB+ on my computer), and takes a long time. In much of my testing it runs out of memory and dies, but if it finishes it takes ~6 minutes (compared to 0.5s with the int64 columns)
Also, I think the results are wrong; the result set should only have 131072 rows, not 358946
And the FixedSizeBinary(16) query plan:
fsb16.parquet.gz
ints.parquet.gz
The text was updated successfully, but these errors were encountered: