Replies: 6 comments 3 replies
-
for the identity columns you are matching on, are they part of the collected stats columns? |
Beta Was this translation helpful? Give feedback.
-
My guess is that the total size of the "current" table is much smaller than the 140TBs. Given the table is being replaced in 25-40% chunks every day or so, there is going to be a large number of prior files that need to be cleaned up with You used to be able to fetch the
|
Beta Was this translation helpful? Give feedback.
-
You should consider using Z-ordering for your Delta table. This means arranging your data based on the columns you often search or filter by. By doing this, Spark can find the data it needs much faster, which can help speed up your MERGE operation. |
Beta Was this translation helpful? Give feedback.
-
Given the constraints with Z-ordering, I suggest two options: Batch Processing: You could break the MERGE operation into smaller batches to update data incrementally. This should help reduce the overall load and runtime. Staging Table Approach: You could create a staging table for transformations and then merge those changes into the main table. This would minimize the impact on the main table during the update process. |
Beta Was this translation helpful? Give feedback.
-
I found that liquid clustering + deletion vectors almost does the trick: you can't optimize the new data without applying deletion vectors. |
Beta Was this translation helpful? Give feedback.
-
How often do you run Lastly, have you tried using multi-part checkpoints for the table? (https://docs.delta.io/latest/optimizations-oss.html#multi-part-checkpointing) it could be that each change (given the size) is taking too long to buffer since the default behavior is to write all changes into a single parquet file. If the checkpoint files are very large, then this could be blocking the ability to exit the REWRITE stage. |
Beta Was this translation helpful? Give feedback.
-
via Himol Shah (deltalake-questions)
Hello,
I am working on a setting up a spark job which performs MERGE (
deltaTable.merge.whenMatched.updateExpr()
) operation on a partitioned Delta table. The compute is 25 i3.8xl instances on Databricks. I'm working with Delta 3.0.0 on Spark 3.5.2.For every run, I'm expecting to update roughly 500 million rows spread in 35k files. Currently, majority of the runtime is consumed by the REWRITING stage of merge i.e. the REWRITE stage keeps going on for > 24 hours. I already have Deletion Vector feature on but considering the volume which is processed in every run, DVs are not of extensive help. I've already baked in partition pruning and am using as many predicates as I can in the MERGE condition. I cannot use replaceWhere since I'm not looking to rewrite entire partitions.
Does anyone have any suggestions on how can I optimize the job or what other approaches I can look into? Any help will be highly appreciated.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions