Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition Limitations #87

Closed
ricklamps opened this issue Dec 8, 2022 · 10 comments · Fixed by #360
Closed

Partition Limitations #87

ricklamps opened this issue Dec 8, 2022 · 10 comments · Fixed by #360

Comments

@ricklamps
Copy link

Quick question, wasn't sure where else to post this. The readme mentions that the plugin only supports the creation of 100 partitions. Is that 100 partition values (i.e. partitioning by date, you can only create 100 days worth of partitions) or partition fields (i.e. I can partition by 100 fields, field1 -> field100)?

@jessedobbelaere
Copy link
Contributor

jessedobbelaere commented Dec 8, 2022

It's an Athena limitation, and it's value-based, so if you have a shipment_date partition, you can only have 100 unique values per query: 2022-01-01, 2022-01-02, 2022-01-03, ...

The workaround from AWS is here: https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html where they create a table structure with a CTA query and only write 31 partitions (for the month of January), then they do a second query (insert) that writes data of the month of February, etc... Basically they are chunking the data into queries that handle less than 100 days (partition values) each...

I personally use an athena-compatible version of https://github.com/dbt-labs/dbt-utils#insert_by_period which can chunk a dbt model into multiple smaller queries. It's kind of what the AWS workaround is doing. But I should check if and how that macro can be open-sourced...

Do you know if this partition limit is only for HIVE tables, and it's not a limit on Iceberg tables @nicor88 ?

@nicor88
Copy link
Contributor

nicor88 commented Dec 8, 2022

It's a limitation of prestodb/trino. I've seen that in Iceberg too. The workaround is to create a backfilling logic as you suggested @jessedobbelaere, I was not aware of inset_by_period, but this is what I did too.

@ricklamps
Copy link
Author

Is there a plan to fix this default behavior in the future? The readme says "currently" in parentheses after mentioning the 100 partition limit which gave me some hope. It seems like we could add some pagination logic in the incremental_insert macro

@nicor88
Copy link
Contributor

nicor88 commented Dec 8, 2022

When using table materialization is tricky as we use ctas, regarding incremental we could consider to have an special pagination by a spefic column, but this gets tricky when more that 1 partition column is used.

@ricklamps
Copy link
Author

Would it be possible to use Glue as the backend catalog instead of Athena? I'm sure that'd be a lot of work to refactor, but Glue seems to be quite a bit more flexible, especially when it comes to partitioning and bigger data sets

@nicor88
Copy link
Contributor

nicor88 commented Dec 8, 2022

@ricklamps not sure what you mean here. Under the hood Glue is already used as catalog, this limitation is on athena itself, read this, that is used as data processor. If the 100 partition count limitation is a really big deal for you you could consider to use dbt-glue/dbt-spark.

Also note that the limitation is due to the fact that athena uses the default hive.max-partitions-per-writers that is by default 100. (cc @jessedobbelaere )

When I have to deal with bigger datasets for example partitioned by date, I have a backfilling logic in my model that allow me to call the model with parameters: start_date/end_date. Therefore I invoke my model multiple times with different inputs, till my table is fully backfilled, of course this works when my model is incremental and not using a table materialization.

I will talk with @jessedobbelaere on how we can support https://github.com/dbt-labs/dbt-utils#insert_by_period in a streamlined way to simplify this use cases, we could for example consider the usage of a https://docs.getdbt.com/reference/commands/run-operation to invoke a custom macro.

@owenprough-sift
Copy link

If anyone is considering making something like dbt_utils.insert_by_period work for Athena, you may be interested in the discussion here: dbt-labs/dbt-core#4174

@rabidaudio
Copy link

I'm adding a comment containing HIVE_TOO_MANY_OPEN_PARTITIONS to this thread so that it comes up in a search for the associated error message

@svdimchenko
Copy link
Contributor

Actually the problem becomes critical once we start using more than 1 partition key column. I've also noticed that athena does not always fail on exactly 100 partitions number (mostly I use iceberg tables).

Currently I'm using my custom materialisation logic, which splits query into batches by default (once partitioning is enabled).

CTAS and merge operations behaviour are different as well due to partitions limit. All this leads me to an idea to implement the following query run logic:

  1. Try to execute query
  2. If query fails with 100 partitions limit, then split query into batches by 100 partitions and run them consecutively

Once this logic is applied to table and incremental strategies, we can not implement insert_by_period one. WDYT ?

@CommonCrisis
Copy link
Contributor

A solution for this is needed - in which way we implement this - idk.
If someone needs support - just let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants