-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Public Parquet API #1032
Comments
I don't know either -- the idea of marking everything lower level as private and then making it re-pub as use cases arise seems reasonable to me, though a bit of an annoying crowd sourcing exercise (annoying to the crowd that is)
I think @sunchao or @nevi-me perhaps are the most knowledgeable about the goals of the crate so perhaps they would want to comment |
Crowd here. Just letting you know that sqlite2parquet required adding the "experimental" flag to upgrade to 8.0.0, in order to get |
@asayers thank you for reporting, I'll take a look later today and see what we can do 👍 |
FWIW I make use of all levels of the api through FileReader, ColumnReader and even PageReader |
There will be some natural tension between exposing all internals of the parquet reader for advanced usecases and the ability to make non-breaking / minimally breaking changes to its implementation. As long as we are deliberate as we find the balance I think we'll be good |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current API for the parquet crate is rather large, and exposes quite a lot of implementation detail.
This has a couple of implications:
Some examples of this
util
module contains all sorts of random stuff - a hash implementation, maths functions, memory tracking, etc...compression
moduledata_type::AsBytes
,data_type::SliceAsBytes
,data_type::SliceAsBytesDataType
data_type::DataType
,ColumnReaderImpl
,RecordReader
schema::types::to_thrift
Describe the solution you'd like
I'm not familiar enough with the design or objectives of the crate to authoritatively weigh in on what should or shouldn't be public, however, it is my observation that a number of the APIs don't appear to be optimised for external consumption.
My personal preference would be to make everything lower than the file-level, i.e.
SerializedFileReader
,ParquetFileArrowReader
,RowIter
crate-local, as this would have the benefit of being pretty unambiguous and easy to communicate and maintain.I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs made private could be made public again in a point-release based on user feedback.
I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?
Any changes would obviously need to be made in a major arrow release, the next of which I believe is in January 2022 (@alamb could maybe confirm) and so there is no immediate urgency, but I wanted to maybe start the discussion
Relates to #171
The text was updated successfully, but these errors were encountered: