Skip to content

Conversation

@julienledem
Copy link
Member

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix comment: [ startOffset, endOffset )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@julienledem julienledem changed the title Avoid reading rowgroup metadata in memory on the client side. PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. Sep 3, 2014
@julienledem julienledem force-pushed the skip_reading_row_groups branch from 7957a92 to 5b6bd1b Compare September 3, 2014 00:32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment endOffset )

@tsdeng
Copy link
Contributor

tsdeng commented Sep 4, 2014

LGTM!

@julienledem
Copy link
Member Author

@tsdeng and the build is green!

@asfgit asfgit closed this in 5dafd12 Sep 5, 2014
tongjiechen pushed a commit to tongjiechen/incubator-parquet-mr that referenced this pull request Oct 8, 2014
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <julien@twitter.com>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups
@julienledem julienledem deleted the skip_reading_row_groups branch October 30, 2014 23:30
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Feb 6, 2015
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <julien@twitter.com>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java
Resolution:
    Conflicts were from whitespace changes and strict type checking (not
    backported). Removed dependence on strict type checking.
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <julien@twitter.com>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java
Resolution:
    Conflicts were from whitespace changes and strict type checking (not
    backported). Removed dependence on strict type checking.
sunchao added a commit to sunchao/parquet-mr that referenced this pull request Aug 1, 2022
apache#43 added the logic to return null when `compressedPages` become empty. However this is not correct with async IO enabled, since the first page may not have been read yet, when the method is called.

This fixes it by adding a `isFinished` variable to indicate whether all the pages have been consumed in the `ColumnChunkPageReadStore`. In addition, this also added a few pre-condition checks to make sure the object won't run into some invalid state.
sunchao added a commit to sunchao/parquet-mr that referenced this pull request Sep 16, 2022
Follow-up of apache#45. This fixes the pre-condition check of `getPageValueCount` method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants