PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

julienledem · 2014-08-27T18:56:00Z

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

tsdeng · 2014-08-29T00:05:00Z

parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java

fix comment: [ startOffset, endOffset )

tsdeng · 2014-09-03T21:16:28Z

parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java

comment endOffset )

tsdeng · 2014-09-04T18:39:23Z

LGTM!

julienledem · 2014-09-05T03:39:04Z

@tsdeng and the build is green!

…ide. This will improve reading big datasets with a large schema (thousands of columns) Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading Author: julien <julien@twitter.com> Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits: ccdd08c [julien] fix parquet-hive 24a2050 [julien] Merge branch 'master' into skip_reading_row_groups 3d7e35a [julien] adress review feedback 5b6bd1b [julien] more tests 323d254 [julien] sdd unit tests f599259 [julien] review feedback fb11f02 [julien] fix backward compatibility check 2c20b46 [julien] cleanup readFooters methods 3da37d8 [julien] fix read summary ab95a45 [julien] cleanup 4d16df3 [julien] implement task side metadata 9bb8059 [julien] first stab at integrating skipping row groups

…ide. This will improve reading big datasets with a large schema (thousands of columns) Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading Author: julien <julien@twitter.com> Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits: ccdd08c [julien] fix parquet-hive 24a2050 [julien] Merge branch 'master' into skip_reading_row_groups 3d7e35a [julien] adress review feedback 5b6bd1b [julien] more tests 323d254 [julien] sdd unit tests f599259 [julien] review feedback fb11f02 [julien] fix backward compatibility check 2c20b46 [julien] cleanup readFooters methods 3da37d8 [julien] fix read summary ab95a45 [julien] cleanup 4d16df3 [julien] implement task side metadata 9bb8059 [julien] first stab at integrating skipping row groups Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java Resolution: Conflicts were from whitespace changes and strict type checking (not backported). Removed dependence on strict type checking.

apache#43 added the logic to return null when `compressedPages` become empty. However this is not correct with async IO enabled, since the first page may not have been read yet, when the method is called. This fixes it by adding a `isFinished` variable to indicate whether all the pages have been consumed in the `ColumnChunkPageReadStore`. In addition, this also added a few pre-condition checks to make sure the object won't run into some invalid state.

Follow-up of apache#45. This fixes the pre-condition check of `getPageValueCount` method.

tsdeng reviewed Aug 29, 2014
View reviewed changes

julienledem mentioned this pull request Aug 30, 2014

PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. apache/parquet-format#8

Closed

julienledem changed the title ~~Avoid reading rowgroup metadata in memory on the client side.~~ PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. Sep 3, 2014

julienledem added 9 commits September 2, 2014 17:32

first stab at integrating skipping row groups

9bb8059

implement task side metadata

4d16df3

cleanup

ab95a45

fix read summary

3da37d8

cleanup readFooters methods

2c20b46

fix backward compatibility check

fb11f02

review feedback

f599259

sdd unit tests

323d254

more tests

5b6bd1b

julienledem force-pushed the skip_reading_row_groups branch from 7957a92 to 5b6bd1b Compare September 3, 2014 00:32

tsdeng reviewed Sep 3, 2014
View reviewed changes

parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java Outdated

Copy link

Contributor

tsdeng Sep 3, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment endOffset )

adress review feedback

3d7e35a

julienledem added 2 commits September 4, 2014 13:59

Merge branch 'master' into skip_reading_row_groups

24a2050

fix parquet-hive

ccdd08c

asfgit closed this in 5dafd12 Sep 5, 2014

julienledem deleted the skip_reading_row_groups branch October 30, 2014 23:30

liancheng mentioned this pull request Feb 10, 2015

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

Closed

liancheng mentioned this pull request Apr 4, 2015

[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time apache/spark#5298

Closed

sunchao added a commit to sunchao/parquet-mr that referenced this pull request Sep 16, 2022

Fix pre-condition check for getPageValueCount (apache#46)

aa0cc06

Follow-up of apache#45. This fixes the pre-condition check of `getPageValueCount` method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

Uh oh!

julienledem commented Aug 27, 2014

Uh oh!

tsdeng Aug 29, 2014

Uh oh!

julienledem Aug 29, 2014

Uh oh!

tsdeng Sep 3, 2014

Uh oh!

tsdeng commented Sep 4, 2014

Uh oh!

julienledem commented Sep 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

Uh oh!

Conversation

julienledem commented Aug 27, 2014

Uh oh!

tsdeng Aug 29, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem Aug 29, 2014

Choose a reason for hiding this comment

Uh oh!

tsdeng Sep 3, 2014

Choose a reason for hiding this comment

Uh oh!

tsdeng commented Sep 4, 2014

Uh oh!

julienledem commented Sep 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants