[Feature Request] Enable Clone of Delta Lake tables #1387

dennyglee · 2022-09-20T05:19:05Z

Feature request

Enable Clone of Delta Lake tables

Overview

Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.

Motivation

For business continuity disaster recovery to streamlining DevOps, cloning of Delta Lake tables

Further details

The context for this functionality can be found at https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

p2bauer · 2022-09-20T05:29:19Z

Great that this is getting visibility, thank you @dennyglee. I think specifically deep clone functionality would be the most useful for some critical DRP scenarios.

That said, is this feature request encompassing the work to port existing functionality from core databricks offering to OSS? Or rather a new implementation from scratch?

dennyglee · 2022-09-21T20:11:22Z

I think so @p2bauer - I think there is still an open debate on which one makes more sense (port or design). Any particular thoughts on approach?

oakesk · 2022-10-16T20:57:43Z

It would be great to have the deep clone as @p2bauer suggests for DRP scenarios; in particular incremental clone/synchronize after the initial clone 👍

armckinney · 2023-05-30T19:41:11Z

Hello, I see on the roadmap (#1307) that shallow clones have been added in 2.3 - is there still plans to add deep clones?

edit: removed alternative question.

I believe for the time being use are going to utilize something like:

%python

clone = (spark.read.format("delta") \
   .option("timestampAsOf", clone_timestamp.isoformat()) \ 
   .load(delta_table_path))

clone.write.format("delta").mode("errorifexists").save(clone_table_path)

Looks like cloned table has similar log data with exception operation being WRITE instead of CLONE
Using this format instead of CREATE TABLE ... syntax allows us to not enable Hive.

sezruby · 2023-06-23T20:09:51Z

What about

Get the list of files for the latest version
Copy all the files, using same directory structure (e.g. /path/to/table/A=1/a.parquet should be copied to /path/to/backuptable/A=1/a.parquet)
Copy /path/to/table/_delta_log dir to /path/to/backuptable/_delta_log

This is manual alternative of DEEP COPY for now.

It's not complete solution, for example, we don't need to copy all _delta_log directory.
However, implementing this version would bring a lot more convenience for DRP.

armckinney · 2023-06-26T05:23:06Z

Interesting take.
This type of approach will certainly be useful in the future for us I think. We are utilizing a 'DeltaStorageFormat' interface for our ingestion pipeliens currently and have been implementing our own features on top of Delta in this manner. I believe the next one for us coming up will be custom retention policies - i.e. the ability to define what versions to keep after a VACUUM process.

Aside for Databricks to consider implementing into Delta (currently our org just don't have the manpower to be able to contribute to the project at the moment in any meaningful way), and might drop hints at the DAIS 2023 this week:

I think for most organizations, this is typical as older data generally becomes stale and is only necessary to keep for CYA and auditing reasons. Thus, we would be looking to implement a fall-off policy, only keeping versions like: 1 version every year for past 7 years, 1 version every month for last year, 1 version every week for last 3 months, 1 version every day for last 30 days.

IoTier · 2023-11-21T11:57:12Z

Hi @dennyglee , Any idea when Deep Clone is going to be available for OSS Delta tables?

dishkakrauch · 2024-11-27T07:10:05Z

Any news?

dennyglee added the enhancement New feature or request label Sep 20, 2022

dennyglee mentioned this issue Sep 20, 2022

Roadmap 2022 H2 (discussion) #1307

Open

MrPowers mentioned this issue Dec 12, 2022

Adds a copy_table function MrPowers/mack#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Enable Clone of Delta Lake tables #1387

[Feature Request] Enable Clone of Delta Lake tables #1387

dennyglee commented Sep 20, 2022

p2bauer commented Sep 20, 2022

dennyglee commented Sep 21, 2022

oakesk commented Oct 16, 2022

armckinney commented May 30, 2023 •

edited

Loading

sezruby commented Jun 23, 2023 •

edited

Loading

armckinney commented Jun 26, 2023

IoTier commented Nov 21, 2023

dishkakrauch commented Nov 27, 2024

[Feature Request] Enable Clone of Delta Lake tables #1387

[Feature Request] Enable Clone of Delta Lake tables #1387

Comments

dennyglee commented Sep 20, 2022

Feature request

Overview

Motivation

Further details

Willingness to contribute

p2bauer commented Sep 20, 2022

dennyglee commented Sep 21, 2022

oakesk commented Oct 16, 2022

armckinney commented May 30, 2023 • edited Loading

sezruby commented Jun 23, 2023 • edited Loading

armckinney commented Jun 26, 2023

IoTier commented Nov 21, 2023

dishkakrauch commented Nov 27, 2024

armckinney commented May 30, 2023 •

edited

Loading

sezruby commented Jun 23, 2023 •

edited

Loading