Skip to content

feat: make namespace related api more friendly for distributed engines#5547

Closed
jackye1995 wants to merge 2 commits intolance-format:mainfrom
jackye1995:expose-current-storage-options
Closed

feat: make namespace related api more friendly for distributed engines#5547
jackye1995 wants to merge 2 commits intolance-format:mainfrom
jackye1995:expose-current-storage-options

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Dec 18, 2025

in distributed engines situation, we have the following problem with vended credentials: we pass in the namespace and table ID to get location and allow dynamic credentials refresh. Then the table is cached and used for serving multiple queries.

When executing in another worker (e.g. spark, lancedb enterprise, etc.), we have to basically fetch the credentials again because we don't know what is the current credentials to use, and the credentials could already been refreshed and is different from the initial input.

This PR adds an API dataset.current_storage_options() to get the latest storage options to be used, so that it can be served as the initial storage options to use in the worker node. This ensures we only make a single call to namespace_client.describe_table. Note that the engine should configure the credentials refresh lead time to be long enough that it is sufficient to use a single credential in the work in most cases.

Another problem is that when distributing to the worker, we already know the location of the table and the storage options to use, so we should just pass that in and use it. But today the API is an either-or, user either pass in uri or pass in namespace + table ID and it would reload uri and storage options. We added documentation and updated API so that if user pass in namespace + table_id, we do the automated workflow to get uri and storage options and set provider as usual, but also give caller the option to set each component manually to match various caching conditions.

@jackye1995 jackye1995 marked this pull request as draft December 18, 2025 23:09
@github-actions github-actions bot added enhancement New feature or request python java labels Dec 18, 2025
@jackye1995 jackye1995 changed the title feat: expose current_storage_options feat: make namespace related api more friendly for distributed engines Dec 18, 2025
@codecov
Copy link

codecov bot commented Dec 18, 2025

@jackye1995 jackye1995 force-pushed the expose-current-storage-options branch from bdff78f to 8dc1662 Compare December 19, 2025 08:07
@github-actions
Copy link
Contributor

Code Review

This PR introduces the StorageOptionsAccessor abstraction to unify storage options and provider management for distributed engine scenarios. The refactoring consolidates storage_options and storage_options_provider into a single accessor pattern with caching and automatic credential refresh.

P0 Issues

None identified.

P1 Issues

  1. Potential race condition in credential refresh retry loop (storage_options.rs)

    In get_provider_options_with_version(), when the write lock is busy, the code returns None and the caller retries with a 10ms sleep. However, there's no maximum retry limit, which could lead to indefinite spinning in pathological cases:

    loop {
        match self.do_get_provider_options().await? {
            Some((opts, version)) => return Ok((opts, version)),
            None => {
                tokio::time::sleep(Duration::from_millis(10)).await;
            }
        }
    }

    Consider adding a maximum retry count or exponential backoff with a timeout.

  2. Test coverage for StorageOptionsAccessor

    The new StorageOptionsAccessor struct (~300+ lines of new code) appears to lack dedicated unit tests. Given this is a critical component for credential management across Python, Java, and Rust, I'd recommend adding tests covering:

    • Caching behavior (expiration, refresh before expiry)
    • Merging of initial options with provider options
    • Version tracking on refresh
    • Edge cases (empty options, no provider, etc.)

Observations

  • The API surface is well-designed with clear precedence rules (provider options override initial options).
  • The migration from storage_options + storage_options_provider to unified storage_options_accessor is consistently applied across Python, Java, and Rust bindings.
  • Documentation and examples in the Java/Python bindings are thorough.

@github-actions
Copy link
Contributor

Code Review

This PR refactors how storage options and credential providers are handled by introducing a unified StorageOptionsAccessor that manages both static options and dynamic providers with automatic caching and refresh.

Summary

The changes unify the previous separate storage_options and storage_options_provider into a single StorageOptionsAccessor that handles option merging, caching, and credential refresh. This is a significant refactoring that improves the API for distributed engines.

P1 Issues

  1. Blocking Hash implementation - StorageOptionsAccessor::hash() uses blocking_read() on a tokio RwLock. This is called from the Hash trait which is sync. If this is called from an async context (which is likely given the codebase's async-first architecture), this could lead to deadlocks or panic. Consider either:

    • Using std::sync::RwLock for initial_options if it's rarely written to
    • Adding documentation warning about calling this from async contexts
    • Adding #[track_caller] or panic guards for async context detection
  2. Test assertion on concurrent refresh - In test_accessor_concurrent_refresh, the assertion call_count >= 1 is very weak for a concurrency test. With 20 concurrent tasks hitting expired cache, you should validate that the provider isn't called 20 times (which would indicate the caching isn't working). Consider adding an upper bound like call_count <= 3 to catch regression where the cache stops working.

Minor Observations

  • The consolidation of storage options and provider into StorageOptionsAccessor is a good API improvement.
  • Good test coverage for the new accessor with various scenarios including expiration and concurrent access.
  • Documentation in Python and Java APIs is comprehensive with clear examples.

The architecture change looks solid overall. The main concern is the blocking Hash impl which should be addressed before merge.

@jackye1995 jackye1995 closed this Jan 16, 2026
jackye1995 added a commit that referenced this pull request Jan 16, 2026
Same as #5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 20, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 20, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant