Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure storage account not properly set for Unity Catalog #3142

Closed
kevinzwang opened this issue Oct 28, 2024 · 15 comments · Fixed by #3165
Closed

Azure storage account not properly set for Unity Catalog #3142

kevinzwang opened this issue Oct 28, 2024 · 15 comments · Fixed by #3165
Assignees
Labels
bug Something isn't working data-catalogs Related to support for Data Catalogs needs triage

Comments

@kevinzwang
Copy link
Member

Describe the bug

See conversation here: #2903 (comment)

To Reproduce

>>> unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AZURE)
>>> unity_table_ext = unity.load_table("some_catalog.some_schema.some_table") 
>>> df_ext = daft.read_deltalake(unity_table_ext)
>>> df_ext.show()

DaftCoreException: DaftError::External Generic AzureBlob error: Azure Storage Account not set and is required.
 Set either `AzureConfig.storage_account` or the `AZURE_STORAGE_ACCOUNT` environment variable.

Expected behavior

Daft sets Azure storage account and properly shows dataframe

Component(s)

Other

Additional context

No response

@djouallah
Copy link
Contributor

same error here

@anilmenon14
Copy link
Contributor

Thank you for logging this issue, @kevinzwang .

Since the below code block works, I wanted to ask your thoughts on whether the logic mentioned below needs to be included in this part of the load_table module for Unity catalog. Happy to contribute if you are looking for any contributions for fixes on this issue.

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AZURE)
unity_table_ext = unity.load_table("some_catalog.some_schema.some_table") 

regex_match_az_storage_acc = re.search(r'@([^\.]+)\.', unity_table_ext.table_uri) # Gather storage account name from table URI
if regex_match_az_storage_acc:
    storage_account_parsed = regex_match_az_storage_acc.group(1)
else:
    raise ValueError("{} does not appear to be a valid Azure Storage URI".format(unity_table_ext.table_uri))

io_config = IOConfig(azure=unity_table_ext.io_config.azure.replace(storage_account=storage_account_parsed))
df_ext = daft.read_deltalake(unity_table_ext.table_uri,io_config=io_config)
df_ext.show() # This works and the DataFrame is materialized successfully

@kevinzwang
Copy link
Member Author

Hi @anilmenon14, thanks for the help! what would a table URI look like for Azure and what part would be the storage account?

@anilmenon14
Copy link
Contributor

Hi @kevinzwang , for ADLS Gen2 , it looks like the below example:

abfss://<some_container>@<storage_account>.dfs.core.windows.net/
Where the <storage_account> is what we are interested in extracting.

One thing to consider is that Azure storage could also be on ADLS Gen1 (adl:// protocol) or Blob storage (wasbs://) . However, I did some checking and we should be safe to assume abfss:// URI's for Databricks Unity Catalog since Unity catalog has a pre-requisite for ADLS Gen2 atleast.
Source: documentation from Databricks published blog

@kevinzwang
Copy link
Member Author

Gotcha. Thank you @anilmenon14 for the information! In that case, it looks like something we probably would like to fix on the rust side in the Azure code, so automatic storage account detection could actually be enabled for any Azure read and not just unity. Feel free to take a stab at that, otherwise I will work on it tomorrow

@anilmenon14
Copy link
Contributor

Thanks @kevinzwang . I suppose you mean that it is best to have changes applied in src/daft-io/src/azure_blob.rs in around this section to handle this.
I am a complete beginner level in Rust, so haven't figured out how to get that done and would be interested to see and learn.
Since table_uri is an object on daft.unity_catalog.unity_catalog.UnityCatalogTable and not part of the IOConfig, I am curious to learn how that can be handled on the Rust side.

@kevinzwang
Copy link
Member Author

kevinzwang commented Nov 1, 2024

No worries @anilmenon14 I'll get a PR out for this, if you could test it out when I'm done with it that would be very helpful already!

@anilmenon14
Copy link
Contributor

Absolutely @kevinzwang . Happy to help get the testing done.

@kevinzwang
Copy link
Member Author

@djouallah @g-kannan The fix for the Unity Azure storage account issue has been merged into our main branch and will be in our next release! Feel free to reopen the issue if you encounter a problem with the fix

@g-kannan
Copy link

Hi Kevin,

Sorry, i'm facing the same error in 0.3.13 as well. Anything to change in code?

Code:
io_config = IOConfig(azure=AzureConfig(storage_account="storageaccount"))
df = daft.read_deltalake(table=unity_table_ext,io_config=io_config)

Error:
DaftCoreException: DaftError::External Generic AzureBlob error: Azure Storage Account not set and is required.
Set either AzureConfig.storage_account or the AZURE_STORAGE_ACCOUNT environment variable.

@g-kannan
Copy link

g-kannan commented Nov 16, 2024

Hi Kevin,

Noted this in latest release

image

@anilmenon14
Copy link
Contributor

Hey @g-kannan ,

@kevinzwang has included a new feature in 3.10.0 that has made the API much simpler and intuitive , for Azure users, from that version onwards.
The PR where the feature was included is [FEAT] Infer Azure storage account from uri

import daft 
import os
from daft.unity_catalog import UnityCatalog

# Pre-req: Include the below env vars in your runtime
DATABRICKS_HOST = os.environ.get('DATABRICKS_HOST') 
PAT_TOKEN = os.environ.get('PAT_TOKEN') 

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN)
unity_table_ext = unity.load_table("some_catalog_in_uc.some_schema_in_uc.some_table_in_uc")
df_ext = daft.read_deltalake(unity_table_ext)
df_ext.show()

I tried the above on 3.13.0 and it works well. Let me know what you see.

@kevinzwang
Copy link
Member Author

Hi @g-kannan , as @anilmenon14 said, Daft should now automatically retrieve the correct storage account from your UC table. Also, if you pass in a unity table and it has credentials attached to it, read_deltalake will ignore the io_config parameter.

However, passing in an io_config should not break the existing one. I have a few suspicions about what the cause of this is. Could you print out unity_table_ext.table_uri? Daft can derive the storage account from a URL that is in the form abfss://<some_container>@<storage_account>.dfs.core.windows.net/, curious to see what yours looks like

@g-kannan
Copy link

Hi Anil\Kevin, Thanks for this. I think its a env issue with my Gitpod workspace. I deleted and retried in new where it worked as it should. table_uri also shows as mentioned above.

image

@kevinzwang
Copy link
Member Author

Great! Glad to hear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data-catalogs Related to support for Data Catalogs needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants