Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Public Parquet API #1032

Closed
tustvold opened this issue Dec 11, 2021 · 6 comments · Fixed by #1134 or #1244
Closed

Reduce Public Parquet API #1032

tustvold opened this issue Dec 11, 2021 · 6 comments · Fixed by #1134 or #1244
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented Dec 11, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The current API for the parquet crate is rather large, and exposes quite a lot of implementation detail.

This has a couple of implications:

  • It complicates iterating on the crate without making breaking changes to public APIs
  • It adds to user's cognitive load as users have to work out what APIs to use

Some examples of this

  • The util module contains all sorts of random stuff - a hash implementation, maths functions, memory tracking, etc...
  • The compression module
  • data_type::AsBytes, data_type::SliceAsBytes, data_type::SliceAsBytesDataType
  • data_type::DataType, ColumnReaderImpl, RecordReader
  • schema::types::to_thrift

Describe the solution you'd like

I'm not familiar enough with the design or objectives of the crate to authoritatively weigh in on what should or shouldn't be public, however, it is my observation that a number of the APIs don't appear to be optimised for external consumption.

My personal preference would be to make everything lower than the file-level, i.e. SerializedFileReader, ParquetFileArrowReader, RowIter crate-local, as this would have the benefit of being pretty unambiguous and easy to communicate and maintain.

I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs made private could be made public again in a point-release based on user feedback.

I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?

Any changes would obviously need to be made in a major arrow release, the next of which I believe is in January 2022 (@alamb could maybe confirm) and so there is no immediate urgency, but I wanted to maybe start the discussion

Relates to #171

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs made private could be made public again in a point-release based on user feedback.

I don't know either -- the idea of marking everything lower level as private and then making it re-pub as use cases arise seems reasonable to me, though a bit of an annoying crowd sourcing exercise (annoying to the crowd that is)

I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?

I think @sunchao or @nevi-me perhaps are the most knowledgeable about the goals of the crate so perhaps they would want to comment

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 5, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 5, 2022
alamb pushed a commit that referenced this issue Jan 10, 2022
…) (#1134)

* Move more parquet functionality behind experimental feature flag (#1032)

* Fix logical conflicts
@alamb alamb added the parquet Changes to the parquet crate label Jan 20, 2022
@asayers
Copy link
Contributor

asayers commented Jan 28, 2022

Crowd here. Just letting you know that sqlite2parquet required adding the "experimental" flag to upgrade to 8.0.0, in order to get DataType, ByteArray, FixedLenByteArray, and Int96.

@tustvold
Copy link
Contributor Author

@asayers thank you for reporting, I'll take a look later today and see what we can do 👍

@alamb
Copy link
Contributor

alamb commented Jan 28, 2022

@zeevm
Copy link
Contributor

zeevm commented Feb 6, 2022

FWIW I make use of all levels of the api through FileReader, ColumnReader and even PageReader

@alamb
Copy link
Contributor

alamb commented Feb 7, 2022

There will be some natural tension between exposing all internals of the parquet reader for advanced usecases and the ability to make non-breaking / minimally breaking changes to its implementation. As long as we are deliberate as we find the balance I think we'll be good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
4 participants