Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

mrshu · 2023-12-04T13:45:36Z

We've ran into this in our production, so I thought it might be interesting for everyone else to see as well.

Let us assume the following materialized view:

{{ config(
    materialized='table',
    table_type='iceberg',
    format='parquet',
    partitioned_by=['bucket(user_id, 5)'],
    table_properties={
     'optimize_rewrite_delete_file_threshold': '2'
     }
) }}

select 'A'          as user_id,
       'pi'         as name,
       'active'     as status,
       17.89        as cost,
       1            as quantity,
       100000000    as quantity_big,
       current_date as my_date

Assume further that the materialization of the above model fails with TOO_MANY_OPEN_PARTITIONS (improbable with such a simple query, but please bear with me).

As per the current version of the code, the following code will get executed:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/incremental/helpers.sql#L71-L73

The batch_incremental_insert function will call the get_partition_batches function in turn:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/incremental/helpers.sql#L25-L27

Which will then do a specific select distinct with the contents of partitioned_by as the parameter:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/helpers/get_partition_batches.sql#L7-L13

In our specific situation this leads to a query on the order of

select distinct bucket(user_id, 5) from (

select 'A'          as user_id,
       'pi'         as name,
       'active'     as status,
       17.89        as cost,
       1            as quantity,
       100000000    as quantity_big,
       current_date as my_date
) order by bucket(user_id, 5);

This query, however, will necessarily fail, as the bucket function is not defined in Athena/Presto(/Trino) and (to the best of my knowledge) actually originates from Iceberg/Spark (https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by). The error will be on the order of (here is to hoping this will help others find this via a search engine in the future):

FUNCTION_NOT_FOUND: line 2:21: Function 'bucket' not registered

I am not sure what a proper fix of something like this would be but as it currently is, when one runs into TOO_MANY_OPEN_PARTITIONS with a model that uses the bucket function in its partitions, there is no way of getting it to work right now. Note that this was introduced in #360, but the queries that would cause this did not "stand a chance" anyway, so this certainly does not mean that there was a bug introduced in #360 -- on the contrary, it certainly help a large class of queries to pass despite the TOO_MANY_OPEN_PARTITIONS issue.

I hope this might help some fellow dbt-athena user in the past. If anyone has any ideas as to what would a fix look like, please do not hesitate to suggest them -- I would be very eager to learn about them.

Thanks!

The text was updated successfully, but these errors were encountered:

svdimchenko · 2023-12-08T16:22:29Z

@mrshu thanks for pointing that out. This case is really not covered in dbt-athena. This can be fixed on stage where the query to get partitions list is generated. Can you help us with contributing such fix ?

svdimchenko · 2023-12-13T10:06:57Z

@mrshu @nicor88 seems we can use the following algorithm to reach correct partitions batches construction: https://iceberg.apache.org/spec/#partition-transforms. Will try to implement it with my colleague @sanromeo

mrshu · 2023-12-13T14:25:05Z

Hi @svdimchenko, sorry for my late reaction.

I would be very happy to contribute a fix for this, but I am not really sure how to go about it. Since you are trying to implement it, please feel free to keep me in the loop -- I would be happy to help with testing it.

Thanks!

sanromeo · 2023-12-28T11:57:06Z

Hi @mrshu!
I added fix and implementation for the issue you mentioned above, please feel free to test it and answer if everything is fine or something happens wrong ☺️

svdimchenko self-assigned this Dec 13, 2023

svdimchenko added enhancement New feature or request iceberg This issue concerns Iceberg integration labels Dec 13, 2023

sanromeo mentioned this issue Dec 28, 2023

feat: Implement support bucket function for more than 100 partitions #549

Merged

9 tasks

svdimchenko closed this as completed in #549 Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

mrshu commented Dec 4, 2023 •

edited

Loading

svdimchenko commented Dec 8, 2023

svdimchenko commented Dec 13, 2023

mrshu commented Dec 13, 2023

sanromeo commented Dec 28, 2023

Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

Comments

mrshu commented Dec 4, 2023 • edited Loading

svdimchenko commented Dec 8, 2023

svdimchenko commented Dec 13, 2023

mrshu commented Dec 13, 2023

sanromeo commented Dec 28, 2023

mrshu commented Dec 4, 2023 •

edited

Loading