-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move DynamicFileCatalog
back to core
#10745
Conversation
datafusion/core/Cargo.toml
Outdated
aws-config = "0.101.0" | ||
aws-credential-types = "0.101.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade aws package to fix the audit failed message:
Crate: rustls
Version: 0.20.9
error: 1 vulnerability found!
Title: `rustls::ConnectionCommon::complete_io` could fall into an infinite loop based on network input
Date: 2024-04-19
ID: RUSTSEC-2024-0336
URL: https://rustsec.org/advisories/RUSTSEC-2024-0336
Severity: 7.5 (high)
Solution: Upgrade to >=0.23.5 OR >=0.22.4, <0.23.0 OR >=0.21.[11](https://github.com/apache/datafusion/actions/runs/9321626823/job/25661037574?pr=10745#step:4:12), <0.22.0
Dependency tree:
rustls 0.20.9
├── tokio-rustls 0.23.4
│ └── hyper-rustls 0.23.2
│ └── aws-smithy-client 0.55.3
│ ├── aws-types 0.55.3
│ │ ├── aws-sig-auth 0.55.3
│ │ │ ├── aws-sdk-sts 0.28.0
│ │ │ │ └── aws-config 0.55.3
│ │ │ │ └── datafusion 38.0.0
│ │ │ │ ├── datafusion-wasmtest 38.0.0
│ │ │ │ ├── datafusion-substrait 38.0.0
│ │ │ │ ├── datafusion-sqllogictest 38.0.0
│ │ │ │ ├── datafusion-proto 38.0.0
│ │ │ │ │ └── datafusion-benchmarks 38.0.0
│ │ │ │ ├── datafusion-examples 38.0.0
│ │ │ │ ├── datafusion-docs-tests 38.0.0
│ │ │ │ └── datafusion-benchmarks 38.0.0
│ │ │ └── aws-sdk-sso 0.28.0
│ │ │ └── aws-config 0.55.3
│ │ ├── aws-sdk-sts 0.28.0
│ │ ├── aws-sdk-sso 0.28.0
│ │ ├── aws-http 0.55.3
│ │ │ ├── aws-sdk-sts 0.28.0
│ │ │ ├── aws-sdk-sso 0.28.0
│ │ │ └── aws-config 0.55.3
│ │ ├── aws-endpoint 0.55.3
│ │ │ ├── aws-sdk-sts 0.28.0
│ │ │ └── aws-sdk-sso 0.28.0
│ │ └── aws-config 0.55.3
│ ├── aws-sdk-sts 0.28.0
│ ├── aws-sdk-sso 0.28.0
│ └── aws-config 0.55.3
├── hyper-rustls 0.[23](https://github.com/apache/datafusion/actions/runs/9321626823/job/25661037574?pr=10745#step:4:24).2
└── aws-smithy-client 0.55.3
I'm not sure which version is better. Just pick up the latest version for version 0.xxx.xx
.
8c76b92
to
a3fb390
Compare
Hi @alamb, I encountered a dependency issue with After some research, I found that some people encountered a similar error message, and someone mentioned it could be an issue with Tokio's limited support for WASM. I found that the failed crate, However, this also means that this feature will have limited support for WASM. Do you have any ideas about this issue? Thanks. |
0417f89
to
9811979
Compare
e492aa3
to
4654397
Compare
To solve the WASM building issue, I disabled the
After some research, it seems to be a known issue with Xcode 15. Refer to: I'm not sure if it's related to the I have tried the following:
None of these solutions have worked. |
090558d
to
efa9bac
Compare
aws-config = "0.101.0" | ||
aws-credential-types = "0.101.0" | ||
# the default features will cause libunwind error when testing on macos paltform. | ||
object_store = { version = "0.9.1", default-features = false, features = ["aws", "gcp", "http"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't know which default feature causes the error on macOS. I just used trial and error many times, and eventually, it worked. I noticed that we disabled the default features in the root Cargo.toml by #8095, but I can't find the reason. Could someone explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be problematic -- I don't think we can add these dependencies to the core datafusion crate without massively increasing the dependency footprint.
The idea is that the core of datafusion can be used from many places (including wasm) that don't want the (very large) additional dependency of the aws sdk. Having the dependencies in the DataFusion CLI is fine as it is the standalone app (rather than a library)
I have checked that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @goldmedal -- it is very impressive you got this to build 🙏
I think in general, I don't think we should add these new dependencies into the core, I think we should split out the features of the dynamic file catalog (which knows how to treat strings as table names and look them up as tables) and the registering / interpreting of URLs
I took a look around in the code, and it seems like one challenge is that the catalog traits (like CatalogProvider
are still in the core crate and thus any new catalog implementation would have to be there.
I think it may be time to break out the catalog code into its own crate and enforce the boundaries a bit more. I wrote up this idea here: #10782
Let me know what you think
aws-config = "0.101.0" | ||
aws-credential-types = "0.101.0" | ||
# the default features will cause libunwind error when testing on macos paltform. | ||
object_store = { version = "0.9.1", default-features = false, features = ["aws", "gcp", "http"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be problematic -- I don't think we can add these dependencies to the core datafusion crate without massively increasing the dependency footprint.
The idea is that the core of datafusion can be used from many places (including wasm) that don't want the (very large) additional dependency of the aws sdk. Having the dependencies in the DataFusion CLI is fine as it is the standalone app (rather than a library)
Converting to draft as we work on separating out the catalog a bit more |
Hi @alamb, is there any update on this topic? Or maybe I can help by splitting out some traits into the new crate |
Hi @goldmedal -- sorry for the lack of update. From my perspective, it is not a good idea to add any more dependencies (even optional) to the core datafusion crate. After thinking about this some more, I think there is a difference between:
Putting the first in the datafusion core makes sense. I think the second comes with many dependencies and thus doesn't make sense. Thus, what do you think about somehow splitting the two pieces of functionalty apart? Maybe via a trait like struct DynamicTableProvider {
...
/// A callback function that is
url_lookup: Arc<dyn UrlLookup>
}
/// Trait for looking up the correct object store instance based on url
pub trait UrlLookup{
fn lookup(&self, url: &Url) -> Result<Arc<dyn ObjectStore>>;
} Then we could put DynamicTableProvider in the core (with a default handler that knew how to look up file:// urls, for example) but leave the s3 specicific stuff in datafusion-cli Thank you for working on this PR / IDE |
I think it makes sense to me. As far as I know, DuckDB also only supports reading local files by default. You need to install an additional extension, |
I think we should close this PR to keep the discussions. Then, I'll create another PR for |
I think this is what https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.register_object_store is used for And access to the remote store is hanlded via and |
I think this is a good idea Given how this project has grown, I think it might also be valuable to file a ticket describing the high level idea too -- are you able to do so? I can do so too, but likely won't have a chance until tomorrow |
Sure, I can file a ticket for it. |
I have filed #10986. Let's move the follow up discussion to it. |
Which issue does this PR close?
No specific issue but mentioned here #4850 (comment)
Rationale for this change
Discussed with @alamb here. #10619 (comment)
What changes are included in this PR?
Move DynamicFileCatalog and object storage-related features from CLI to the DataFusion core.
Are these changes tested?
No, just move original tests to core.
Are there any user-facing changes?
No