-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition Limitations #87
Comments
It's an Athena limitation, and it's value-based, so if you have a The workaround from AWS is here: https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html where they create a table structure with a CTA query and only write 31 partitions (for the month of January), then they do a second query (insert) that writes data of the month of February, etc... Basically they are chunking the data into queries that handle less than 100 days (partition values) each... I personally use an athena-compatible version of https://github.com/dbt-labs/dbt-utils#insert_by_period which can chunk a dbt model into multiple smaller queries. It's kind of what the AWS workaround is doing. But I should check if and how that macro can be open-sourced... Do you know if this partition limit is only for HIVE tables, and it's not a limit on Iceberg tables @nicor88 ? |
It's a limitation of prestodb/trino. I've seen that in Iceberg too. The workaround is to create a backfilling logic as you suggested @jessedobbelaere, I was not aware of inset_by_period, but this is what I did too. |
Is there a plan to fix this default behavior in the future? The readme says "currently" in parentheses after mentioning the 100 partition limit which gave me some hope. It seems like we could add some pagination logic in the incremental_insert macro |
When using table materialization is tricky as we use ctas, regarding incremental we could consider to have an special pagination by a spefic column, but this gets tricky when more that 1 partition column is used. |
Would it be possible to use Glue as the backend catalog instead of Athena? I'm sure that'd be a lot of work to refactor, but Glue seems to be quite a bit more flexible, especially when it comes to partitioning and bigger data sets |
@ricklamps not sure what you mean here. Under the hood Glue is already used as catalog, this limitation is on athena itself, read this, that is used as data processor. If the 100 partition count limitation is a really big deal for you you could consider to use dbt-glue/dbt-spark. Also note that the limitation is due to the fact that athena uses the default When I have to deal with bigger datasets for example partitioned by date, I have a backfilling logic in my model that allow me to call the model with parameters: start_date/end_date. Therefore I invoke my model multiple times with different inputs, till my table is fully backfilled, of course this works when my model is incremental and not using a table materialization. I will talk with @jessedobbelaere on how we can support https://github.com/dbt-labs/dbt-utils#insert_by_period in a streamlined way to simplify this use cases, we could for example consider the usage of a https://docs.getdbt.com/reference/commands/run-operation to invoke a custom macro. |
If anyone is considering making something like |
I'm adding a comment containing |
Actually the problem becomes critical once we start using more than 1 partition key column. I've also noticed that athena does not always fail on exactly 100 partitions number (mostly I use iceberg tables). Currently I'm using my custom materialisation logic, which splits query into batches by default (once partitioning is enabled). CTAS and merge operations behaviour are different as well due to partitions limit. All this leads me to an idea to implement the following query run logic:
Once this logic is applied to table and incremental strategies, we can not implement insert_by_period one. WDYT ? |
A solution for this is needed - in which way we implement this - idk. |
Quick question, wasn't sure where else to post this. The readme mentions that the plugin only supports the creation of 100 partitions. Is that 100 partition values (i.e. partitioning by date, you can only create 100 days worth of partitions) or partition fields (i.e. I can partition by 100 fields, field1 -> field100)?
The text was updated successfully, but these errors were encountered: