Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using UnityCatalogTable in DataFrame.write_deltalake #3336

Closed
kevinzwang opened this issue Nov 19, 2024 · 6 comments
Closed

Allow using UnityCatalogTable in DataFrame.write_deltalake #3336

kevinzwang opened this issue Nov 19, 2024 · 6 comments
Labels
data-catalogs Related to support for Data Catalogs delta-lake enhancement New feature or request good first issue Good for newcomers

Comments

@kevinzwang
Copy link
Member

Is your feature request related to a problem?

You can create a Unity Catalog table in Daft using daft.unity_catalog.UnityCatalog.load_table(tbl). At the moment, you can only use that table with daft.read_deltalake.

Describe the solution you'd like

We should also be able to similarly use it for DataFrame.write_deltalake.

Describe alternatives you've considered

We can extract the table URI and io config from the unity catalog table and pass those into write_deltalake manually, but that is not preferred.

Additional Context

No response

Would you like to implement a fix?

No

@kevinzwang kevinzwang added enhancement New feature or request good first issue Good for newcomers data-catalogs Related to support for Data Catalogs delta-lake needs triage labels Nov 19, 2024
@kevinzwang kevinzwang changed the title Allow using UnityCatalogTable in DataFrame.read_deltalake Allow using UnityCatalogTable in DataFrame.write_deltalake Nov 20, 2024
@anilmenon14
Copy link
Contributor

anilmenon14 commented Nov 29, 2024

Hey @kevinzwang , I am planning to work on this issue over the next couple of days and believe it can be implemented using at least a couple of API choices. I would like to get your opinion from Eventual's perspective on which approach might be better.

Option A: unity.write_table() call triggers sink to the table :

from daft.unity_catalog import UnityCatalog
unity = UnityCatalog(endpoint=DATABRICKS_HOST, token=PAT_TOKEN)

# Performs the write on execution of the below command
unity.write_table(
    table="some_uc_catalog.some_schema.some_table",
    df=myDaftDF,
    storage_path="abfss:/some_path.......",
    write_mode="overwrite"  # "append" is an option as well
)

Option B: daft.write_deltalake accepts a UnityCatalogTable object that has additional information regarding overwrite/append:

from daft.unity_catalog import UnityCatalog
unity = UnityCatalog(endpoint=DATABRICKS_HOST, token=PAT_TOKEN)

unity_table_to_write = unity.write_table(
    table_name="some_uc_catalog.some_schema.some_table",
    storage_path="abfss:/some_path.......",
    write_mode="overwrite"  # "append" is an option as well
)

# Performs write on execution of the below
myDaftDF.write_deltalake(unity_table_to_write)

Interested to hear your thoughts.

@kevinzwang
Copy link
Member Author

Hi @anilmenon14, thanks for offering to take this on! DataFrame.write_deltalake already has a mode parameter in which you can specify your write mode (we already support overwrite and append). Would it make sense to use that?

I was thinking the API would look something like this:

from daft.unity_catalog import UnityCatalog
unity = UnityCatalog(endpoint=DATABRICKS_HOST, token=PAT_TOKEN)
table = unity.load_table("tbl_name")

df.write_deltalake(table, mode="overwrite")

@anilmenon14
Copy link
Contributor

Thanks for the direction @kevinzwang . This indeed is intuitive and I had not looked into using mode parameter, which clearly fits well.
If we need to continue to use unity.load_table, we will have to adapt this method to accept storage location path for a non-existent table (i.e. for brand new tables being created by Daft ) as only external tables are supported in the current Unity catalog API. This would also mean adapting the self._client.tables.retrieve(table_name) method to instead be self._client.tables.create(**params) for new table creation, where one of the params is the schema of the table.
I am looking into whether we can exclude having to declare the schema at this stage and instead have the schema 'evolve' when doing the df.write_deltalake , at which point we have access to the schema to pass it down.
I'll work on the pull request and once it is ready, will tag you on it and reference this issue from it 👍

@kevinzwang
Copy link
Member Author

Great! Looking forward to the PR

@anilmenon14
Copy link
Contributor

I have just logged a PR for Daft support for Unity catalog table writes.
#3522

@kevinzwang
Copy link
Member Author

Thank you @anilmenon14, will be sure to give it a review soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-catalogs Related to support for Data Catalogs delta-lake enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants