Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ObjectStoreProvider to return None (return Result<Option> rather than Result) #3595

Closed
wants to merge 1 commit into from

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #3594.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 23, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #3595 (e855faf) into master (abcb39d) will decrease coverage by 0.00%.
The diff coverage is 20.00%.

@@            Coverage Diff             @@
##           master    #3595      +/-   ##
==========================================
- Coverage   86.06%   86.05%   -0.01%     
==========================================
  Files         300      300              
  Lines       56328    56333       +5     
==========================================
  Hits        48479    48479              
- Misses       7849     7854       +5     
Impacted Files Coverage Δ
datafusion/core/src/datasource/object_store.rs 81.52% <20.00%> (-3.54%) ⬇️
datafusion/common/src/scalar.rs 85.25% <0.00%> (-0.07%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.42% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tustvold
Copy link
Contributor

Why is this necessary, if returning None will just get converted into an error anyway?

@yahoNanJing
Copy link
Contributor Author

Why is this necessary, if returning None will just get converted into an error anyway?

I don't think convert None into an error is a good idea. Otherwise, why don't you just convert catch the error in the provider and convert error into None.

One example usage is apache/datafusion-ballista#260. I think for some usage, returning None is reasonable.

@tustvold
Copy link
Contributor

tustvold commented Sep 23, 2022

convert error into None

Because you want to provide context on what the error was? As far as I see it the provider has more context about why an object store couldn't be found and will produce strictly better errors than ObjectStoreRegistry can in response to Ok(None)

Returning Ok(None) will just get converted into a less useful error by ObjectStoreRegistry than the one the ObjectStoreProvider could have produced. ObjectStoreRegistry can only say "no store found" on receiving Ok(None), the ObjectStoreProvider can say " invalid scheme" or "invalid bucket" or "missing credentials", " needs to be compiled with hdfs feature enabled", etc... Basically I don't see when returning Ok(None) would result in better UX than returning the actual error that prevented an object store being found?

@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Sep 23, 2022

Returning Ok(None) will just get converted into a less useful error by ObjectStoreRegistry than the one the ObjectStoreProvider could have produced.

The ObjectStoreProvider is a new added way to find an object store. In the future, there may be other ways besides ObjectStoreProvider to find an object store. Then we can add similar logic as follows:

// If no store found, then try to find and register one through XXXXX.
        if store.is_none() {
            store = self.register_through_XXXXX(url)?;
        }

Therefore, I don't think it's a good idea for the provider to throw an error directly if there's no object store find. And we should not mix the None and error in ObjectStoreProvider. It's the ObjectStoreRegistry's duty to decide whether convert None to error.

@tustvold
Copy link
Contributor

tustvold commented Sep 23, 2022

A result is just an option with context on why it is empty? I think I'm missing something, what you describe would be easier without another layer of optionality?

On error you could easily try a different approach, and if that also fails return the union of the errors?

Perhaps we could make this change as and when we add this new discovery mechanism, if it proves necessary. FWIW I'm sceptical that composition within ObjectStoreProvider wouldn't be sufficient, as opposed to adding more options to ObjectStoreRegistry

@yahoNanJing
Copy link
Contributor Author

Could you take a look of apache/datafusion-ballista#260? It's one example usage of the ObjectStoreProvider. I just don't think it's a good idea to mix error and none in the ObjectStoreProvider

@tustvold
Copy link
Contributor

tustvold commented Sep 23, 2022

I already left a comment on it - apache/datafusion-ballista#260 (comment) 😅

It would be superior imo if it matched the scheme and returned an error saying "need to compile with hdfs feature" if a hdfs url was requested but the feature was not enabled at compile time. Similar approaches could be taken for S3, GCP, etc...

@yahoNanJing
Copy link
Contributor Author

It would be superior imo if it matched the scheme and returned an error saying "need to compile with hdfs feature" if a hdfs url was requested but the feature was not enabled at compile time. Similar approaches could be taken for S3, GCP, etc...

For a standalone system, we should know which object stores to be used before running it. Therefore, it's OK to make the decision at the compiling phase by enabling different features. If we want to support multiple object stores, like S3, HDFS, we can just enable multiple features there. And let the ObjectStoreProvider parse and decide which one to return.

@tustvold
Copy link
Contributor

tustvold commented Sep 23, 2022

An error is just an option with more context, if you want to discard that context and use it as an option, you can use Result::ok. I'm still not seeing a use-case for returning Ok(None), the case above would benefit from returning a meaningful error to explain why it couldn't create a store if it couldn't, regardless of how it determines what store to use? Perhaps I'm being thick, but I really don't understand what you are trying to achieve with this change

@alamb
Copy link
Contributor

alamb commented Sep 23, 2022

What if we could encode the same information of Ok(None) as an DataFusionError variant

So if we had

    fn get_by_url(&self, url: &Url) -> Result<dyn ObjectStore> {
      ...
    }

Then the code in ballista might look like:

       // If no store found, then try to find and register one through XXXXX.
        match store {
          Err(DataFusionError::StoreNotFound(_)) => {
            store = self.register_through_XXXXX(url)?;
          }
          _ => return store;

Or something similar

The benefit of this, as @tustvold mentions, is that the reason for the store not being found (like support wasn't compiled in) could be included

That being said, I don't personally have a preference between Result<Option<_>> and if this unblocks work upstream it seems reasonable to me

@tustvold
Copy link
Contributor

tustvold commented Sep 23, 2022

A structured error would be a good solution if we need to respond differently to the not found error 👍, we may be able to keep it even simpler though if all error variants are to be handled in the same manner, which I suspect is the case

@yahoNanJing
Copy link
Contributor Author

Thanks @alamb. The structured error is better than throw a general error. However, I still don't think it should be an error when no object store found through the ObjectStoreProvider. It's the ObjectStoreRegistry's duty to decide when to throw an error and when to continue in case of that we add new ways to find and register object stores in the future beside ObjectStoreProvider.

@tustvold
Copy link
Contributor

tustvold commented Sep 26, 2022

I still don't think it should be an error when no object store found through the ObjectStoreProvider.

What do you mean by not found? I think this is key, I was under the perhaps mistaken impression it was any condition that meant the ObjectStoreProvider was unable to provide an ObjectStore, be it invalid URL, scheme, environment, etc..., basically any error. What is the special "not found" semantic and why is it special?

@yahoNanJing
Copy link
Contributor Author

Hi @tustvold, thanks for point it out. I think the main bifurcation points is the role of the ObjectStoreProvider. The initial purpose of introducing ObjectStoreProvider is to make it act as an supplementary way to find an object store besides the manual registration way. However, if we hope to make ObjectStoreProvider able to find all of the object stores used, then I will agree with your suggestion.

@tustvold
Copy link
Contributor

supplementary way to find an object store besides the manual registration way

Aah, I hadn't understood this. I was viewing ObjectStoreProvider as the way to lazily register object stores on-demand. In particular I had viewed the manual registration as serving the "integrated DBMS" use-case, such as IOx, where buckets, etc... are known up-front, and the ObjectStoreProvider serving the ad-hoc use-case where users might add/remove buckets using DLL statements. So if an ObjectStore isn't already registered, and the call to ObjectStoreProvider fails, that is the error that will be surfaced to the user. Does that make sense?

@yahoNanJing
Copy link
Contributor Author

It also makes sense.

If only we make an agreement of the role of ObjectStoreProvider, I'm OK with the related implementation. Then maybe it's better for us to add some comments for the ObjectStoreProvider

@tustvold
Copy link
Contributor

Sounds good to me, I would be happy to do this if you would like? Sorry for the somewhat protracted discussion, it would appear we were both operating with differing sets of assumptions 😅

@yahoNanJing
Copy link
Contributor Author

Assumption is important😅. It's my fault for not clearing my assumption in the beginning. I'm OK with your assumption, if you can refine the current comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow ObjectStoreProvider to return None (return Result<Option> rather than Result)
5 participants