Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_deltalake on Unity Catalog Table from Databricks has invalid region configuration #2903

Closed
lukaskratoch opened this issue Sep 24, 2024 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@lukaskratoch
Copy link

lukaskratoch commented Sep 24, 2024

I am trying to read a table stored in Unity Catalog (external data access enabled) in Databricks and I am getting "OSError: Generic S3 error: Received redirect without LOCATION, this normally indicates an incorrectly configured region", even though the region is explicitly defined in io_config:

import daft
from daft.unity_catalog import UnityCatalog
from daft.io import IOConfig, S3Config
from dotenv import dotenv_values

env_cfg = dotenv_values()

unity = UnityCatalog(
    endpoint=env_cfg.get('DBX_ENDPOINT'),
    token=env_cfg.get('DBX_TOKEN'),
)

print(unity.list_catalogs())# See all available catalogs, works OK
print(unity.list_schemas('test_catalog')) # See available schemas in a given catalog, works OK
print(unity.list_tables('test_catalog.test_schema')) # See available tables in a given schema, works OK

cfg = unity.load_table('test_catalog.test_schema.test_table') # works OK
io_config = IOConfig(s3=S3Config(region_name='eu-central-1'))
cfg_df = daft.read_deltalake(cfg, io_config=io_config) # here is, where the error happens

And I am getting this output

[...catalogs...]
[...schemas...]
[...tables...]

With this error

failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })), connection: Unknown } }) }))
failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })), connection: Unknown } }) }))
S3 Credentials not provided or found when making client for us-east-1! Reverting to Anonymous mode. the credential provider was not enabled
[2024-09-23T13:10:00Z WARN aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })), connection: Unknown } }) }))
[2024-09-23T13:10:00Z WARN aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: io error: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })), connection: Unknown } }) }))
Traceback (most recent call last):
File "poc_uc_daft.py", line 29, in
cfg_df = daft.read_deltalake(cfg, io_config=io_config)
File "/c/Users/xxx/Projects/xxx/venv/lib/python3.8/site-packages/daft/api_annotations.py", line 39, in _wrap
return timed_func(*args, **kwargs)
File "/c/Users/xxx/Projects/xxx/venv/lib/python3.8/site-packages/daft/analytics.py", line 228, in tracked_fn
result = fn(*args, **kwargs)
File "/c/Users/xxx/Projects/xxx/venv/lib/python3.8/site-packages/daft/io/_deltalake.py", line 74, in read_deltalake
delta_lake_operator = DeltaLakeScanOperator(table_uri, storage_config=storage_config)
File "/c/Users/xxx/Projects/xxx/venv/lib/python3.8/site-packages/daft/delta_lake/delta_lake_scan.py", line 63, in init
self._table = DeltaTable(
File "/c/Users/xxx/Projects/xxx/venv/lib/python3.8/site-packages/deltalake/table.py", line 380, in init
self._table = RawDeltaTable(
OSError: Generic S3 error: Received redirect without LOCATION, this normally indicates an incorrectly configured region

Desktop (please complete the following information):

  • Windows 11

Am doing something wrong or is it a bug? Is there a workaround?
May it be related to this 2 days old issue? #2879

@samster25
Copy link
Member

👁️ @kevinzwang

@jaychia
Copy link
Contributor

jaychia commented Sep 24, 2024

Yeah the delta-rs SDK is very dumb about regions -- we have to provide it with exactly the right region otherwise it will freak out.

Reading the code, we take the IOConfig (and hence the region) provided by Unity Catalog:

        # Override the storage_config with the one provided by Unity catalog
        table_io_config = table.io_config
        if table_io_config is not None:
            storage_config = StorageConfig.native(NativeStorageConfig(multithreaded_io, table_io_config))

My guess is that Unity doesn't give us the right region, or likely doesn't give us a region at all. We might need to corner case this 😬

@lukaskratoch
Copy link
Author

opt. B - it's not giving any region.

  1. I checked it and my Databricks / Unity Catalog instance is definitely on "eu-central-1" as stated in the code above.
  2. I printed cfg.io_config (unity catalog table . io config) and its None
    Screenshot 2024-09-25 115703

@kevinzwang
Copy link
Member

It looks like we currently ignore the given io_config if you pass in a unity catalog table, and we do not construct an S3Config with a region in our UnityCatalog.load_table.

We'd like to get a fix for this but in the mean time @lukaskratoch what you can do is add the region to the table's io config:

cfg = unity.load_table('test_catalog.test_schema.test_table') 
cfg.io_config = IOConfig(s3=cfg.io_config.s3.replace(region_name='eu-central-1'))
cfg_df = daft.read_deltalake(cfg) # should work now

@jaychia jaychia added the bug Something isn't working label Sep 25, 2024
@lukaskratoch
Copy link
Author

Initially I was getting this err:
image

UnityCatalogTable has to be set to frozen=False (I just overwrote it in local libraries)

After that I am getting this err:
DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 30s }), connection: Unknown } })
which I guess reffers to this issue in delta-rs

@kevinzwang
Copy link
Member

@lukaskratoch what version of deltalake are you using?

As for a solution to the io config thing without having to modify the library, you could perhaps extract the table_uri and the io_config and pass those in manually.

cfg = unity.load_table('test_catalog.test_schema.test_table') 
io_config = IOConfig(s3=cfg.io_config.s3.replace(region_name='eu-central-1'))
daft.read_deltalake(cfg.table_uri, io_config=io_config)

@anilmenon14
Copy link
Contributor

@kevinzwang , your recommended solution that did the trick for me. I had run into the same issue that @lukaskratoch had run into as well and was searching if an issue was logged on this and found this.

@kevinzwang ,
A question I had, which I couldn't figure out yet is if there is any way that APIs from daft could figure out the region of the table URI ?
If it is not possible, I think the Databricks SDK WorkspaceClient might be a way to get the region of the metastore to pass it down to daft.read_deltalake

this is the piece of code that works for me and uses Databricks SDK WorkspaceClient to retrieve the metastore region.

import daft # Using version 0.3.4
from daft.unity_catalog import UnityCatalog
from daft.io import IOConfig, S3Config
from databricks.sdk import WorkspaceClient # Using version  0.33.0
import os

DATABRICKS_HOST = os.environ.get('DATABRICKS_HOST') 
PAT_TOKEN_AWS = os.environ.get('PAT_TOKEN_AWS') #i.e, the PAT token stored in environment variables 

w = WorkspaceClient(host=DATABRICKS_HOST, token=PAT_TOKEN_AWS)
metastore_summary = w.metastores.summary()

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AWS)

# Read an external table 
unity_table_ext = unity.load_table("main_catalog.cust_another_schema.sample_external_access_table") # This is an external table
io_config = IOConfig(s3=unity_table_ext.io_config.s3.replace(region_name=metastore_summary.region))
df_ext = daft.read_deltalake(unity_table_ext.table_uri,io_config=io_config)
df_ext.show()

@lukaskratoch
Copy link
Author

@kevinzwang

@lukaskratoch what version of deltalake are you using?

previously deltalake==0.19.1, today I updated to deltalake==0.20.1, still having the same connection timeout "DispatchFailure" error.

@kevinzwang
Copy link
Member

@lukaskratoch @anilmenon14 Thank you for the information! I am taking a look into these issues and hope to have an update for you soon

@jordandakota
Copy link

jordandakota commented Oct 8, 2024

Adding an azure perspective here, this is what I had to do to be able to read from the table, which seems like credential vending just isn't working for azure since it's still trying s3.

from daft.unity_catalog import UnityCatalog
import daft
from deltalake.table import DeltaTable
from daft.io import IOConfig, AzureConfig

azure_config = AzureConfig(
    storage_account="myaccount",
    tenant_id="my_tenant",
    client_id="my_client",
    client_secret="my_secret",
)

io_config = IOConfig(azure=azure_config)

unity = UnityCatalog(
    endpoint="https://<workspace-address>.azuredatabricks.net",
    token="my_token",
)

unity_table = unity.load_table("catalog.schema.table")
df = daft.read_deltalake(unity_table.table_uri, io_config=io_config)

df.show()

It still gives me the output of:

failed to load region from IMDS err=failed to load IMDS session token: dispatch 
failure: io error: error trying to connect: tcp connect error: A socket operation was attempted to an unreachable network. (os error 10051): tcp connect error: 
A socket operation was attempted to an unreachable network. (os error 10051): A 
socket operation was attempted to an unreachable network. (os error 10051) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 10051, kind: NetworkUnreachable, message: "A socket operation was attempted to an unreachable network." })), connection: Unknown } }) 
}))
failed to load region from IMDS err=failed to load IMDS session token: dispatch 
failure: io error: error trying to connect: tcp connect error: A socket operation was attempted to an unreachable network. (os error 10051): tcp connect error: 
A socket operation was attempted to an unreachable network. (os error 10051): A 
socket operation was attempted to an unreachable network. (os error 10051) (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 10051, kind: NetworkUnreachable, message: "A socket operation was attempted to an unreachable network." })), connection: Unknown } }) 
}))
S3 Credentials not provided or found when making client for us-east-1! Reverting to Anonymous mode. the credential provider was not enabled

Didn't seem like anything I tried with the s3 config part did anything. I had to pass in my own credentials entirely. Even when I added a custom s3_config and set it to southcentralus it would still tell me us-east-1.

Note, passing in my own azure credential does work and I can read the table. I just can't get my token to do credential vending.

@anilmenon14
Copy link
Contributor

Hi @jordandakota ,
I have been running into the same problem on Azure storage and logged a new issue 3024.
I believe I know where the issue is and will raise a PR this week with a solution to try to get Azure support.

@jaychia
Copy link
Contributor

jaychia commented Oct 9, 2024

Yes, when we first built this integration the Unity Catalog only vended S3 credentials 😛

I think unity has made some progress since then, but we actually do need to probably upgrade the Python SDK to get the new updated API spec.

WRT regions, we will have to play around with the API a little to figure out what databricks' implementation of Unity is returning us.

kevinzwang added a commit that referenced this issue Oct 23, 2024
We do this already when we read from S3, but delta-rs does not, so their
metadata reads fail. This is especially an issue for unity catalog
tables, where the region is not specified anywhere.

Tested locally, it works but I'm unsure how I would test this in unit
tests since this is sort of unity+aws behavior specific

Fix for #2903
@kevinzwang
Copy link
Member

kevinzwang commented Oct 23, 2024

v0.3.9 has just been released which should fix the S3 region issue! I am closing this issue but @lukaskratoch @anilmenon14 please try it out and if you still run into issues, feel free to reopen this issue.

@jordandakota your Azure issue was also addressed in the latest release. If you have trouble with it, please create a new issue to report it

@anilmenon14
Copy link
Contributor

anilmenon14 commented Oct 24, 2024

@kevinzwang, Thanks for the update on the S3 region issue fix. I can confirm it works fine, but with a small issue. It appears botocore and boto3 are not part of the dependencies to install and hence are not installed as default. I installed them separately in my virtual environment and got past the issue.
I can confirm this pattern now works for AWS, without having to pass down the region name for AWS regions outside us-east-1

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AWS)
unity_table_mngd = unity.load_table("some_catalog.some_schema.some_table")  # This table storage is on eu-west-1
df_mngd = daft.read_deltalake(unity_table_mngd)
df.show() 

As for the Azure issue, this appears to be fixed too, but needs a bit of workaround which I am hoping we can avoid in the future.

What does not work for Azure:

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AZURE)
unity_table_ext = unity.load_table("some_catalog.some_schema.some_table") 
df_ext = daft.read_deltalake(unity_table_ext)
df_ext.show()

Error:

DaftCoreException: DaftError::External Generic AzureBlob error: Azure Storage Account not set and is required.
 Set either `AzureConfig.storage_account` or the `AZURE_STORAGE_ACCOUNT` environment variable.

What works for Azure now

unity = UnityCatalog(endpoint=DATABRICKS_HOST,token=PAT_TOKEN_AZURE)
unity_table_ext = unity.load_table("some_catalog.some_schema.some_table") 

regex_match_az_storage_acc = re.search(r'@([^\.]+)\.', unity_table_ext.table_uri) # Gather storage account name from table URI
if regex_match_az_storage_acc:
    storage_account_parsed = regex_match_az_storage_acc.group(1)
else:
    raise ValueError("{} does not appear to be a valid Azure Storage URI".format(unity_table_ext.table_uri))

io_config = IOConfig(azure=unity_table_ext.io_config.azure.replace(storage_account=storage_account_parsed))
df_ext = daft.read_deltalake(unity_table_ext.table_uri,io_config=io_config)
df_ext.show()

@jordandakota , when you have a chance, you can test as well if you see this behavior in Azure.
@kevinzwang , I haven't had a chance to review the internals of df.show() to inspect this behavior in Azure and we could try and solve it from a new issue logged, if you agree.

@jordandakota
Copy link

I did test and came to the same workaround just slightly differently. Reporting in now as confirmed.

@g-kannan
Copy link

Hi @kevinzwang , As @anilmenon14 mentioned i too got the same error for Azure. Replacing the storage account name solved the issue for me. Please note or create a new issue, as this seems to be a workaround.

Works in Azure:
io_config = IOConfig(azure=unity_table_ext.io_config.azure.replace(storage_account='storage_account'))
df = daft.read_deltalake(table=unity_table_ext.table_uri,io_config=io_config)
df.show()

@kevinzwang
Copy link
Member

@anilmenon14 @g-kannan will take a look at these issues, thank you for reporting them. Tracking it in a new issue #3142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants