Partition Limitations #87

ricklamps · 2022-12-08T15:35:38Z

Quick question, wasn't sure where else to post this. The readme mentions that the plugin only supports the creation of 100 partitions. Is that 100 partition values (i.e. partitioning by date, you can only create 100 days worth of partitions) or partition fields (i.e. I can partition by 100 fields, field1 -> field100)?

jessedobbelaere · 2022-12-08T15:44:17Z

It's an Athena limitation, and it's value-based, so if you have a shipment_date partition, you can only have 100 unique values per query: 2022-01-01, 2022-01-02, 2022-01-03, ...

The workaround from AWS is here: https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html where they create a table structure with a CTA query and only write 31 partitions (for the month of January), then they do a second query (insert) that writes data of the month of February, etc... Basically they are chunking the data into queries that handle less than 100 days (partition values) each...

I personally use an athena-compatible version of https://github.com/dbt-labs/dbt-utils#insert_by_period which can chunk a dbt model into multiple smaller queries. It's kind of what the AWS workaround is doing. But I should check if and how that macro can be open-sourced...

Do you know if this partition limit is only for HIVE tables, and it's not a limit on Iceberg tables @nicor88 ?

nicor88 · 2022-12-08T16:53:11Z

It's a limitation of prestodb/trino. I've seen that in Iceberg too. The workaround is to create a backfilling logic as you suggested @jessedobbelaere, I was not aware of inset_by_period, but this is what I did too.

ricklamps · 2022-12-08T18:00:19Z

Is there a plan to fix this default behavior in the future? The readme says "currently" in parentheses after mentioning the 100 partition limit which gave me some hope. It seems like we could add some pagination logic in the incremental_insert macro

nicor88 · 2022-12-08T18:04:58Z

When using table materialization is tricky as we use ctas, regarding incremental we could consider to have an special pagination by a spefic column, but this gets tricky when more that 1 partition column is used.

ricklamps · 2022-12-08T18:47:32Z

Would it be possible to use Glue as the backend catalog instead of Athena? I'm sure that'd be a lot of work to refactor, but Glue seems to be quite a bit more flexible, especially when it comes to partitioning and bigger data sets

nicor88 · 2022-12-08T20:29:12Z

@ricklamps not sure what you mean here. Under the hood Glue is already used as catalog, this limitation is on athena itself, read this, that is used as data processor. If the 100 partition count limitation is a really big deal for you you could consider to use dbt-glue/dbt-spark.

Also note that the limitation is due to the fact that athena uses the default hive.max-partitions-per-writers that is by default 100. (cc @jessedobbelaere )

When I have to deal with bigger datasets for example partitioned by date, I have a backfilling logic in my model that allow me to call the model with parameters: start_date/end_date. Therefore I invoke my model multiple times with different inputs, till my table is fully backfilled, of course this works when my model is incremental and not using a table materialization.

I will talk with @jessedobbelaere on how we can support https://github.com/dbt-labs/dbt-utils#insert_by_period in a streamlined way to simplify this use cases, we could for example consider the usage of a https://docs.getdbt.com/reference/commands/run-operation to invoke a custom macro.

owenprough-sift · 2022-12-09T13:37:41Z

If anyone is considering making something like dbt_utils.insert_by_period work for Athena, you may be interested in the discussion here: dbt-labs/dbt-core#4174

rabidaudio · 2022-12-12T02:34:41Z

I'm adding a comment containing HIVE_TOO_MANY_OPEN_PARTITIONS to this thread so that it comes up in a search for the associated error message

svdimchenko · 2023-06-12T21:54:14Z

Actually the problem becomes critical once we start using more than 1 partition key column. I've also noticed that athena does not always fail on exactly 100 partitions number (mostly I use iceberg tables).

Currently I'm using my custom materialisation logic, which splits query into batches by default (once partitioning is enabled).

CTAS and merge operations behaviour are different as well due to partitions limit. All this leads me to an idea to implement the following query run logic:

Try to execute query
If query fails with 100 partitions limit, then split query into batches by 100 partitions and run them consecutively

Once this logic is applied to table and incremental strategies, we can not implement insert_by_period one. WDYT ?

CommonCrisis · 2023-07-12T11:46:17Z

A solution for this is needed - in which way we implement this - idk.
If someone needs support - just let me know!

nicor88 mentioned this issue Mar 8, 2023

How do you overcome the limitation of 100 active partition writers Tomme/dbt-athena#149

Open

svdimchenko mentioned this issue Jul 28, 2023

feat: fix athena partitions limit #360

Merged

4 tasks

nicor88 closed this as completed in #360 Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition Limitations #87

Partition Limitations #87

ricklamps commented Dec 8, 2022

jessedobbelaere commented Dec 8, 2022 •

edited

Loading

nicor88 commented Dec 8, 2022

ricklamps commented Dec 8, 2022

nicor88 commented Dec 8, 2022

ricklamps commented Dec 8, 2022

nicor88 commented Dec 8, 2022 •

edited

Loading

owenprough-sift commented Dec 9, 2022

rabidaudio commented Dec 12, 2022

svdimchenko commented Jun 12, 2023

CommonCrisis commented Jul 12, 2023

Partition Limitations #87

Partition Limitations #87

Comments

ricklamps commented Dec 8, 2022

jessedobbelaere commented Dec 8, 2022 • edited Loading

nicor88 commented Dec 8, 2022

ricklamps commented Dec 8, 2022

nicor88 commented Dec 8, 2022

ricklamps commented Dec 8, 2022

nicor88 commented Dec 8, 2022 • edited Loading

owenprough-sift commented Dec 9, 2022

rabidaudio commented Dec 12, 2022

svdimchenko commented Jun 12, 2023

CommonCrisis commented Jul 12, 2023

jessedobbelaere commented Dec 8, 2022 •

edited

Loading

nicor88 commented Dec 8, 2022 •

edited

Loading