-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileStream requires fake ObjectStore when ParquetFileReaderFactory is used #4533
Comments
FWIW this is to support serialization of
Would it work to just inline the |
Unfortunately not, There is yet another way, a bit more intrusive but in long term might be worthwhile. DataFusion could have Its own trait for
The reason why I prefer a factory of Could you tell me what are best practices for interacting with What do you think about moving schema inference into scan and removing it from |
But we could change that? All the
No objection on principle, but I'm sceptical that introducing more indirection is necessary nor desirable. We already have far more factories, provider, etc... than I think is strictly necessary and it makes reasoning about the code incredibly hard.
I don't honestly know, I believe @alamb is currently working on making the state/config slightly less impenetrable.
I don't think this is possible, as planning needs to know the schema. In general though performing schema inference per query is very expensive, especially for non-parquet data. I strongly recommend investing in some sort of catalog to store this data. |
It should be possible, if we would make FileOpener open accept
That route should hopefully unify
Is there an open issue/pr for that work ? Maybe I could suggest adding unique id to identify requests in TableProvider scan operations.
I thought that maybe schema could be kept(cached) in TableProvider implementations but It would be exposed only through scan operation. For example FileScanConfig holds a reference to the schema. Sure It could be solved differently, but in my case schema inference is not as bad as It would seem to be. |
#4601 PTAL |
I am tracking some work in #4349 |
bump, what do you think ? In such case, schema inference would be a detail of scan operation, It could be inferred ahead of time or on every query. |
Are you suggesting not having schema information in the LogicalPlan, and only inferring schema on lowering to the physical plan? This feels like a fairly fundamental change? |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am using
ParquetExec
in combination withParquetFileReaderFactory
to bypassObjectStore
. In order to createFileScanConfig
, I need to fill a fake parameter forobject_store_url
, laterFileStream
creation fails because It tries to fetchObjectStore
that doesn't exist. Right now I work around that problem by creating fakeObjectStore
that is never used.Describe the solution you'd like
One solution that comes to my mind is making
FileOpener
self contained by combining each Opener withObjectStore
andFileMeta
. There are some challenges with that solution, e.g.ParquetFileReaderFactory
wants to ownFileMeta
when creating a reader, that would require to either cloneFileMeta
or take self inFileOpener
to moveFileMeta
out of self. Alternatively we could haveFileMeta
behind shared pointer, but then we cannot moveObjectMeta
out of it.One thing that might be a positive outcome, is that in future files with different openers could be processed within single FileStream.
Please share your thoughts, I really wouldn't mind if there is a simpler way to improve that situation.
The text was updated successfully, but these errors were encountered: