Tips and Tricks for Upserting Massive Delta Tables #3775

newfront · 2024-10-16T21:50:30Z

newfront
Oct 16, 2024
Maintainer

via Himol Shah (deltalake-questions)

Hello,
I am working on a setting up a spark job which performs MERGE (deltaTable.merge.whenMatched.updateExpr()) operation on a partitioned Delta table. The compute is 25 i3.8xl instances on Databricks. I'm working with Delta 3.0.0 on Spark 3.5.2.

For every run, I'm expecting to update roughly 500 million rows spread in 35k files. Currently, majority of the runtime is consumed by the REWRITING stage of merge i.e. the REWRITE stage keeps going on for > 24 hours. I already have Deletion Vector feature on but considering the volume which is processed in every run, DVs are not of extensive help. I've already baked in partition pruning and am using as many predicates as I can in the MERGE condition. I cannot use replaceWhere since I'm not looking to rewrite entire partitions.

Does anyone have any suggestions on how can I optimize the job or what other approaches I can look into? Any help will be highly appreciated.
Thank you.

newfront · 2024-10-16T21:50:44Z

newfront
Oct 16, 2024
Maintainer Author

for the identity columns you are matching on, are they part of the collected stats columns? delta.dataSkippingNumIndexedCols . Given the size of the job (across 35k files) you are probably needing to open all files which is where the higher cost is probably coming into play.

2 replies

newfront Oct 16, 2024
Maintainer Author

but also updating 500 million rows can be costly. Do you have a percentage of rows updated per file (assuming the 35k files have N average rows per file) then you could see if you are routinely replacing more than X% of each file for this job. If you are essentially replacing more than 50% of each file across all files, then it might make sense to think about ways to isolate the merge job to target a smaller percentage of the table and than fan-out the job to reduce the time to completion.

This is common if you have date based partitions (or clusterBy with date) on the table. How large is the table (TBs or GBs)?

newfront Oct 16, 2024
Maintainer Author

via Himol Shah

Thanks a lot for responding on this
@Scott Haines

No. The identity columns I'm matching up on are not the part of collected stats columns.
The table is partitioned on date and hour. Table is roughly 140 TBs in size with 180k files in all. I'll have to perform some calculations to get how many records per file am I updating but I am pretty sure it should be less than 50%. 25-30% possibly.
For reducing the run time, I am thinking to spin multiple parallel jobs (one job per partition or x partitions). But I'm also worried about the heavy cost incurred by the rewrites. I currently have deletion vectors setting on for the table. But a Databricks SME said that it won't be helpful if I'm updating more than 10% records in a file - which is kinda what I need to do.

newfront · 2024-10-18T18:45:39Z

newfront
Oct 18, 2024
Maintainer Author

My guess is that the total size of the "current" table is much smaller than the 140TBs. Given the table is being replaced in 25-40% chunks every day or so, there is going to be a large number of prior files that need to be cleaned up with VACUUM. Do you need to go back and use time travel on the table? Also - what is the size in TBs or GBs of the current Snapshot?

You used to be able to fetch the snapshot directly from the DeltaLog object, but the method for DeltaLog.snapshot is deprecated, while getSnapshotAtInit is how you get the CapturedSnapshot which contains the reference to the Snapshot. Using the Snapshot reference

import org.apache.spark.sql.delta.{DeltaLog}
val deltaLog = DeltaLog.forName(spark, "catalog.tableName")
val currentSnapshot = deltaLog.getSnapshotAtInit()
val tableSizeInBytes = currentSnapshot.snapshot.deltaFileSizeInBytes()

0 replies

Pshak-20000 · 2024-10-21T08:57:22Z

Pshak-20000
Oct 21, 2024

You should consider using Z-ordering for your Delta table. This means arranging your data based on the columns you often search or filter by. By doing this, Spark can find the data it needs much faster, which can help speed up your MERGE operation.

1 reply

tinder-himolshah Oct 22, 2024

I cannot Z-order the Delta table because of business reasons. Moreover, Z-order will trigger rewriting the entire table. So that in itself will be a very expensive and time consuming operation. And then if I run MERGE statement, the table will lose Z-order. So overall, I'll just be splitting load of one operation into two.

Pshak-20000 · 2024-10-23T05:16:43Z

Pshak-20000
Oct 23, 2024

Given the constraints with Z-ordering, I suggest two options:

Batch Processing: You could break the MERGE operation into smaller batches to update data incrementally. This should help reduce the overall load and runtime.

Staging Table Approach: You could create a staging table for transformations and then merge those changes into the main table. This would minimize the impact on the main table during the update process.

0 replies

orthoxerox · 2024-10-25T07:02:42Z

orthoxerox
Oct 25, 2024

I found that liquid clustering + deletion vectors almost does the trick: you can't optimize the new data without applying deletion vectors.

0 replies

newfront · 2024-10-25T16:29:45Z

newfront
Oct 25, 2024
Maintainer Author

How often do you run REORG to apply the deletion vector (soft deletes) changes to the table (create new snapshot)?
How often do you run VACUUM to clean up the total size of the table? (to clean up the probably massive number of change files?)

Lastly, have you tried using multi-part checkpoints for the table? (https://docs.delta.io/latest/optimizations-oss.html#multi-part-checkpointing) it could be that each change (given the size) is taking too long to buffer since the default behavior is to write all changes into a single parquet file. If the checkpoint files are very large, then this could be blocking the ability to exit the REWRITE stage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips and Tricks for Upserting Massive Delta Tables #3775

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tips and Tricks for Upserting Massive Delta Tables #3775

newfront Oct 16, 2024 Maintainer

Replies: 6 comments · 3 replies

newfront Oct 16, 2024 Maintainer Author

newfront Oct 16, 2024 Maintainer Author

newfront Oct 16, 2024 Maintainer Author

newfront Oct 18, 2024 Maintainer Author

Pshak-20000 Oct 21, 2024

tinder-himolshah Oct 22, 2024

Pshak-20000 Oct 23, 2024

orthoxerox Oct 25, 2024

newfront Oct 25, 2024 Maintainer Author

newfront
Oct 16, 2024
Maintainer

Replies: 6 comments 3 replies

newfront
Oct 16, 2024
Maintainer Author

newfront Oct 16, 2024
Maintainer Author

newfront Oct 16, 2024
Maintainer Author

newfront
Oct 18, 2024
Maintainer Author

Pshak-20000
Oct 21, 2024

Pshak-20000
Oct 23, 2024

orthoxerox
Oct 25, 2024

newfront
Oct 25, 2024
Maintainer Author