Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions #529

Closed
mrshu opened this issue Dec 4, 2023 · 4 comments · Fixed by #549
Closed
Assignees
Labels
enhancement New feature or request iceberg This issue concerns Iceberg integration

Comments

@mrshu
Copy link

mrshu commented Dec 4, 2023

We've ran into this in our production, so I thought it might be interesting for everyone else to see as well.

Let us assume the following materialized view:

{{ config(
    materialized='table',
    table_type='iceberg',
    format='parquet',
    partitioned_by=['bucket(user_id, 5)'],
    table_properties={
     'optimize_rewrite_delete_file_threshold': '2'
     }
) }}

select 'A'          as user_id,
       'pi'         as name,
       'active'     as status,
       17.89        as cost,
       1            as quantity,
       100000000    as quantity_big,
       current_date as my_date

Assume further that the materialization of the above model fails with TOO_MANY_OPEN_PARTITIONS (improbable with such a simple query, but please bear with me).

As per the current version of the code, the following code will get executed:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/incremental/helpers.sql#L71-L73

The batch_incremental_insert function will call the get_partition_batches function in turn:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/incremental/helpers.sql#L25-L27

Which will then do a specific select distinct with the contents of partitioned_by as the parameter:

https://github.com/dbt-athena/dbt-athena/blob/bae9fce151317b235eba2d497d9e16a1aaa7975c/dbt/include/athena/macros/materializations/models/helpers/get_partition_batches.sql#L7-L13

In our specific situation this leads to a query on the order of

select distinct bucket(user_id, 5) from (

select 'A'          as user_id,
       'pi'         as name,
       'active'     as status,
       17.89        as cost,
       1            as quantity,
       100000000    as quantity_big,
       current_date as my_date
) order by bucket(user_id, 5);

This query, however, will necessarily fail, as the bucket function is not defined in Athena/Presto(/Trino) and (to the best of my knowledge) actually originates from Iceberg/Spark (https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by). The error will be on the order of (here is to hoping this will help others find this via a search engine in the future):

FUNCTION_NOT_FOUND: line 2:21: Function 'bucket' not registered

I am not sure what a proper fix of something like this would be but as it currently is, when one runs into TOO_MANY_OPEN_PARTITIONS with a model that uses the bucket function in its partitions, there is no way of getting it to work right now. Note that this was introduced in #360, but the queries that would cause this did not "stand a chance" anyway, so this certainly does not mean that there was a bug introduced in #360 -- on the contrary, it certainly help a large class of queries to pass despite the TOO_MANY_OPEN_PARTITIONS issue.

I hope this might help some fellow dbt-athena user in the past. If anyone has any ideas as to what would a fix look like, please do not hesitate to suggest them -- I would be very eager to learn about them.

Thanks!

@svdimchenko
Copy link
Contributor

@mrshu thanks for pointing that out. This case is really not covered in dbt-athena. This can be fixed on stage where the query to get partitions list is generated. Can you help us with contributing such fix ?

@svdimchenko
Copy link
Contributor

@mrshu @nicor88 seems we can use the following algorithm to reach correct partitions batches construction: https://iceberg.apache.org/spec/#partition-transforms. Will try to implement it with my colleague @sanromeo

@svdimchenko svdimchenko self-assigned this Dec 13, 2023
@svdimchenko svdimchenko added enhancement New feature or request iceberg This issue concerns Iceberg integration labels Dec 13, 2023
@mrshu
Copy link
Author

mrshu commented Dec 13, 2023

Hi @svdimchenko, sorry for my late reaction.

I would be very happy to contribute a fix for this, but I am not really sure how to go about it. Since you are trying to implement it, please feel free to keep me in the loop -- I would be happy to help with testing it.

Thanks!

@sanromeo
Copy link
Contributor

Hi @mrshu!
I added fix and implementation for the issue you mentioned above, please feel free to test it and answer if everything is fine or something happens wrong ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request iceberg This issue concerns Iceberg integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants