Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetReader.RowGroups property #509

Merged
merged 7 commits into from
Jun 3, 2024

Conversation

danielearwicker
Copy link
Contributor

@danielearwicker danielearwicker commented May 4, 2024

Row group readers are created up-front and held in a list, so the OpenRowGroupReader method is just a wrapper around list[n].

There is a note in the reader's Dispose method that says that although dispose is not required, it may be in the future so clients so clients have to take on extra ceremony. But the current design is super fast and efficient even with very large numbers of row groups, so it shouldn't be necessary.

By clarifying that Dispose isn't required, this enables a simple ParquetReader.RowGroups property that exposes a read-only list of row group readers. This then enables Linq chained expressions to operate on row groups, use Reverse to read from the end, Where to filter on column stats, and ToAsyncEnumerable().SelectManyAwait to deserialise into a stream of objects. Can do this with multiple sorted parquet sources and then merge them with SortedMergeBy, before batching them back into row groups and directing them to a destination stream. With the right extension methods a whole operation like this could be written declaratively.

Rather than causing a breaking change to existing clients I've added an interface that omits Dispose but retained it on the class, so the RowGroups property (which uses the non-disposable interface) makes it clear that disposal is unnecessary.

@danielearwicker
Copy link
Contributor Author

Noticed that the macOS tests are failing right now, no Apple M1 binary in a certain .jar. I did a super-hacky fix for it:

jar uf ../parquet-tools/parquet-tools-1.9.0.jar org/xerial/snappy/native/Mac/aarch64/libsnappyjava.dylib

i.e. add the missing binary to the existing jar. It now appears to work, except on 3.1 where there's a separate issue possible because that only has X64 which isn't being installed?

@aloneguid
Copy link
Owner

the build should be OK now, just merging here from master...

@aloneguid aloneguid self-requested a review May 22, 2024 10:54
@aloneguid
Copy link
Owner

The build succeeds now. I have no issues accepting this, however you might want to add tests with use cases to guarantee this interface is not forgotten and removed during next refactoring session? Custom .jar can be removed as well.

@aloneguid
Copy link
Owner

I like the .jar hack though ;) But will build latest from official sources later.

@danielearwicker
Copy link
Contributor Author

Cool, I've added a simple test for the property and reverted the .jar change.

Copy link
Owner

@aloneguid aloneguid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@aloneguid aloneguid merged commit 26389c4 into aloneguid:master Jun 3, 2024
9 checks passed
@aloneguid
Copy link
Owner

pusing to pre-release -pre.6 to be included in 4.24

@aloneguid aloneguid added this to the 4.24.0 milestone Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants