feat: make namespace related api more friendly for distributed engines#5547
feat: make namespace related api more friendly for distributed engines#5547jackye1995 wants to merge 2 commits intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
bdff78f to
8dc1662
Compare
Code ReviewThis PR introduces the P0 IssuesNone identified. P1 Issues
Observations
|
Code ReviewThis PR refactors how storage options and credential providers are handled by introducing a unified SummaryThe changes unify the previous separate P1 Issues
Minor Observations
The architecture change looks solid overall. The main concern is the blocking Hash impl which should be addressed before merge. |
Same as #5547, but that one is too hard to rebase at this moment. For easier integration with distributed engines, we want a way to get `dataset.latest_storage_options()` through the initial static storage options + the dynamic storage options provider. However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component `StorageOptionsAccessor`, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a `StorageOptionsAccessor` instead of just the `storage_options` map. For user API, we continue to let them input storage options and storage options provider, `StorageOptionsAccessor` is used only internally. We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path: 1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code. 2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed. Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Same as lance-format#5547, but that one is too hard to rebase at this moment. For easier integration with distributed engines, we want a way to get `dataset.latest_storage_options()` through the initial static storage options + the dynamic storage options provider. However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component `StorageOptionsAccessor`, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a `StorageOptionsAccessor` instead of just the `storage_options` map. For user API, we continue to let them input storage options and storage options provider, `StorageOptionsAccessor` is used only internally. We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path: 1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code. 2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed. Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Same as lance-format#5547, but that one is too hard to rebase at this moment. For easier integration with distributed engines, we want a way to get `dataset.latest_storage_options()` through the initial static storage options + the dynamic storage options provider. However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component `StorageOptionsAccessor`, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a `StorageOptionsAccessor` instead of just the `storage_options` map. For user API, we continue to let them input storage options and storage options provider, `StorageOptionsAccessor` is used only internally. We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path: 1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code. 2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed. Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Same as lance-format#5547, but that one is too hard to rebase at this moment. For easier integration with distributed engines, we want a way to get `dataset.latest_storage_options()` through the initial static storage options + the dynamic storage options provider. However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component `StorageOptionsAccessor`, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a `StorageOptionsAccessor` instead of just the `storage_options` map. For user API, we continue to let them input storage options and storage options provider, `StorageOptionsAccessor` is used only internally. We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path: 1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code. 2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed. Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Same as lance-format#5547, but that one is too hard to rebase at this moment. For easier integration with distributed engines, we want a way to get `dataset.latest_storage_options()` through the initial static storage options + the dynamic storage options provider. However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component `StorageOptionsAccessor`, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a `StorageOptionsAccessor` instead of just the `storage_options` map. For user API, we continue to let them input storage options and storage options provider, `StorageOptionsAccessor` is used only internally. We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path: 1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code. 2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed. Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
in distributed engines situation, we have the following problem with vended credentials: we pass in the namespace and table ID to get location and allow dynamic credentials refresh. Then the table is cached and used for serving multiple queries.
When executing in another worker (e.g. spark, lancedb enterprise, etc.), we have to basically fetch the credentials again because we don't know what is the current credentials to use, and the credentials could already been refreshed and is different from the initial input.
This PR adds an API
dataset.current_storage_options()to get the latest storage options to be used, so that it can be served as the initial storage options to use in the worker node. This ensures we only make a single call tonamespace_client.describe_table. Note that the engine should configure the credentials refresh lead time to be long enough that it is sufficient to use a single credential in the work in most cases.Another problem is that when distributing to the worker, we already know the location of the table and the storage options to use, so we should just pass that in and use it. But today the API is an either-or, user either pass in uri or pass in namespace + table ID and it would reload uri and storage options. We added documentation and updated API so that if user pass in namespace + table_id, we do the automated workflow to get uri and storage options and set provider as usual, but also give caller the option to set each component manually to match various caching conditions.