Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Enable Clone of Delta Lake tables #1387

Open
1 of 3 tasks
dennyglee opened this issue Sep 20, 2022 · 8 comments
Open
1 of 3 tasks

[Feature Request] Enable Clone of Delta Lake tables #1387

dennyglee opened this issue Sep 20, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@dennyglee
Copy link
Contributor

Feature request

Enable Clone of Delta Lake tables

Overview

Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.

Motivation

For business continuity disaster recovery to streamlining DevOps, cloning of Delta Lake tables

Further details

The context for this functionality can be found at https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@dennyglee dennyglee added the enhancement New feature or request label Sep 20, 2022
@p2bauer
Copy link

p2bauer commented Sep 20, 2022

Great that this is getting visibility, thank you @dennyglee. I think specifically deep clone functionality would be the most useful for some critical DRP scenarios.

That said, is this feature request encompassing the work to port existing functionality from core databricks offering to OSS? Or rather a new implementation from scratch?

@dennyglee
Copy link
Contributor Author

I think so @p2bauer - I think there is still an open debate on which one makes more sense (port or design). Any particular thoughts on approach?

@oakesk
Copy link

oakesk commented Oct 16, 2022

It would be great to have the deep clone as @p2bauer suggests for DRP scenarios; in particular incremental clone/synchronize after the initial clone 👍

@armckinney
Copy link

armckinney commented May 30, 2023

Hello, I see on the roadmap (#1307) that shallow clones have been added in 2.3 - is there still plans to add deep clones?



edit: removed alternative question.

I believe for the time being use are going to utilize something like:

%python

clone = (spark.read.format("delta") \
   .option("timestampAsOf", clone_timestamp.isoformat()) \ 
   .load(delta_table_path))

clone.write.format("delta").mode("errorifexists").save(clone_table_path)
  • Looks like cloned table has similar log data with exception operation being WRITE instead of CLONE
  • Using this format instead of CREATE TABLE ... syntax allows us to not enable Hive.

@sezruby
Copy link
Contributor

sezruby commented Jun 23, 2023

What about

  1. Get the list of files for the latest version
  2. Copy all the files, using same directory structure (e.g. /path/to/table/A=1/a.parquet should be copied to /path/to/backuptable/A=1/a.parquet)
  3. Copy /path/to/table/_delta_log dir to /path/to/backuptable/_delta_log

This is manual alternative of DEEP COPY for now.

It's not complete solution, for example, we don't need to copy all _delta_log directory.
However, implementing this version would bring a lot more convenience for DRP.

@armckinney
Copy link

Interesting take.
This type of approach will certainly be useful in the future for us I think. We are utilizing a 'DeltaStorageFormat' interface for our ingestion pipeliens currently and have been implementing our own features on top of Delta in this manner. I believe the next one for us coming up will be custom retention policies - i.e. the ability to define what versions to keep after a VACUUM process.

Aside for Databricks to consider implementing into Delta (currently our org just don't have the manpower to be able to contribute to the project at the moment in any meaningful way), and might drop hints at the DAIS 2023 this week:

I think for most organizations, this is typical as older data generally becomes stale and is only necessary to keep for CYA and auditing reasons. Thus, we would be looking to implement a fall-off policy, only keeping versions like: 1 version every year for past 7 years, 1 version every month for last year, 1 version every week for last 3 months, 1 version every day for last 30 days.

@IoTier
Copy link

IoTier commented Nov 21, 2023

Hi @dennyglee , Any idea when Deep Clone is going to be available for OSS Delta tables?

@dishkakrauch
Copy link

Any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants