Skip to content

Conversation

@rgruener
Copy link

Adds way to control the min/max amount of rows to pass when checking on the block size instead of hard coded values.

public static final boolean DEFAULT_IS_DICTIONARY_ENABLED = true;
public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming a public constants is a backward incompatible change. How about keeping the original constants as an alias for the new constants, mark and document that they are deprecated with a reference to the new names, and use the new constants in the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally forgot about that, fixed.

flushRowGroupToStore();
initStore();
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
recordCountForNextMemCheck = min(max(props.getMinRowCountForBlockSizeCheck(), recordCount / 2), props.getMaxRowCountForBlockSizeCheck());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, it seems that the local constants MINIMUM_RECORD_COUNT_FOR_CHECK and MAXIMUM_RECORD_COUNT_FOR_CHECK are not needed anymore. Could you please remove them? (recordCountForNextMemCheck should be initialized by using the corresponding value from ParquetProperties.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I also removed the unused rowGroupSize instance variable

@rdblue
Copy link
Contributor

rdblue commented Jun 18, 2018

@rgruener, @gszadovszky, there's already an open PR for this and a duplicate issue: PARQUET-869. The PR is #470, which I've updated for @pradeepg26.

Sorry for the confusion and duplication, but I'd like to continue with that one since the contribution from Pradeep was made earlier. I just shouldn't commit it myself because I ended up making the changes I requested when I backported this to our Parquet version.

@rgruener
Copy link
Author

Ah, I wish I saw that before. The duplicate JIRA issue should probably be closed. I am fine having that one continue, we just would like this change to get into the next release since we have been patching this in ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants