Skip to content

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Jan 16, 2026

Same as #5547, but that one is too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get dataset.latest_storage_options() through the initial static storage options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR lifts the logic up into a component StorageOptionsAccessor, so that we can invoke it at dataset level, and also just wrap it as a CredentialsProvider for AWS. We also make ObjectStoreParams now hold a StorageOptionsAccessor instead of just the storage_options map.

For user API, we continue to let them input storage options and storage options provider, StorageOptionsAccessor is used only internally.

We also remove a few features added that turned out to be unnecessary in the whole credentials vending code path:

  1. ignore_namespace_storage_options: this just never turns out to be possible to be set to true, I was overthinking in the beginning, so remove related code.
  2. s3_credentials_refresh_offset_seconds: we overloaded this feature to set credentials refresh lead time for the storage options provider-based AWS credentials provider. But this is now not fitting the framework since the feature will be applicable to gcp and azure as well. Note that I removed all the related code in python and java. Although s3_credentials_refresh_offset exists for a long time, there was not really a way to set it in python and java. I added it for storage options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in DirectoryNamespace for all 3 clouds. So we can now provide a generic solution to vend credentials and test everything end to end for GCP and Azure. However this PR still keeps the feature in AWS, we will do another PR to add GCP and Azure support officially

@github-actions github-actions bot added enhancement New feature or request python java labels Jan 16, 2026
@github-actions
Copy link
Contributor

Code Review

P0: Compilation Error in Tests

The tests in rust/lance-io/src/object_store/providers/aws.rs will not compile. The function signatures for from_provider and from_provider_with_initial have changed to remove the Duration parameter, but the test calls still include it:

New signatures:

pub fn from_provider(provider: Arc<dyn StorageOptionsProvider>) -> Self
pub fn from_provider_with_initial(
    provider: Arc<dyn StorageOptionsProvider>,
    initial_options: HashMap<String, String>,
) -> Self

Test calls still have extra Duration argument:

// These calls have 2 or 3 args, but should have 1 or 2
DynamicStorageOptionsCredentialProvider::from_provider(
    mock.clone(),
    Duration::from_secs(300), // <- Should be removed
);

DynamicStorageOptionsCredentialProvider::from_provider_with_initial(
    mock.clone(),
    Duration::from_secs(300), // <- Should be removed
    initial_options,
);

Similarly, StorageOptionsAccessor::with_provider is called with a Duration in documentation/tests but the actual signature only takes one argument.

Please verify this builds and tests pass before merging.


P1: Breaking API Change Without Migration Path

The PR removes s3_credentials_refresh_offset_seconds from all public APIs (Python, Java, Rust) without a clear migration path for users who were relying on this parameter. While the new approach of using refresh_offset_millis in storage options is cleaner, existing users will have their code break silently (the parameter is removed, not deprecated with a warning).

Consider adding a migration note to the release notes or emitting a warning if users try to use the old parameter name.


Minor: Test Coverage

The new StorageOptionsAccessor has good unit test coverage. The concurrent access test is particularly valuable.

@jackye1995 jackye1995 changed the title feat: introduce storage options accessor refactor!: introduce storage options accessor Jan 16, 2026
jackye1995 and others added 3 commits January 15, 2026 23:53
- Fix s3_test.rs to use storage_options_accessor instead of storage_options
- Fix formatting in python/src/file.rs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
///
/// The returned accessor will always return the provided options.
/// This is useful when credentials don't expire or are managed externally.
pub fn static_options(options: HashMap<String, String>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does with_static_options give more uniformity re: naming?

Renamed StorageOptionsAccessor::static_options() to with_static_options()
for naming consistency with with_provider() and with_initial_and_provider().

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jackye1995 jackye1995 merged commit 5454242 into lance-format:main Jan 16, 2026
26 of 28 checks passed
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 20, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@jackye1995 jackye1995 mentioned this pull request Jan 20, 2026
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 20, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Same as lance-format#5547, but that one is
too hard to rebase at this moment.

For easier integration with distributed engines, we want a way to get
`dataset.latest_storage_options()` through the initial static storage
options + the dynamic storage options provider.

However, current implementation pushes this logic down to AWS. This PR
lifts the logic up into a component `StorageOptionsAccessor`, so that we
can invoke it at dataset level, and also just wrap it as a
CredentialsProvider for AWS. We also make `ObjectStoreParams` now hold a
`StorageOptionsAccessor` instead of just the `storage_options` map.

For user API, we continue to let them input storage options and storage
options provider, `StorageOptionsAccessor` is used only internally.

We also remove a few features added that turned out to be unnecessary in
the whole credentials vending code path:
1. ignore_namespace_storage_options: this just never turns out to be
possible to be set to true, I was overthinking in the beginning, so
remove related code.
2. s3_credentials_refresh_offset_seconds: we overloaded this feature to
set credentials refresh lead time for the storage options provider-based
AWS credentials provider. But this is now not fitting the framework
since the feature will be applicable to gcp and azure as well. Note that
I removed all the related code in python and java. Although
s3_credentials_refresh_offset exists for a long time, there was not
really a way to set it in python and java. I added it for storage
options provider only, but now it is no longer needed.

Note that recently I added credentials vending server side support in
DirectoryNamespace for all 3 clouds. So we can now provide a generic
solution to vend credentials and test everything end to end for GCP and
Azure. However this PR still keeps the feature in AWS, we will do
another PR to add GCP and Azure support officially

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants