Skip to content

Conversation

@julienledem
Copy link
Member

This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should tweak the initialSize here.
levels should get a tiny initial size (100 bytes?) in case they are always null or always defined.

@julienledem
Copy link
Member Author

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should 5 be configurable too?

we could also make CapacityByteArrayOutputStream abstract or take as an argument a slab size calculator etc. so that we can plug in different behaviors here. what do you think?

@isnotinvain
Copy link
Contributor

Do you want to tweak the initial size here as well?
Do you think we should try this out internally and see if it's an improvement first?

@isnotinvain
Copy link
Contributor

@julienledem ping!

@isnotinvain
Copy link
Contributor

Sent a PR against this PR here: julienledem#2

@isnotinvain
Copy link
Contributor

@tsdeng ok, this PR is now ready to review, it's got both @julienledem's changes and mine as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm this, not used.

…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java
	parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java
@isnotinvain
Copy link
Contributor

+1, lets merge when the tests are green

@isnotinvain
Copy link
Contributor

I'm running these tests here:
https://github.com/isnotinvain/incubator-parquet-mr/pull/2

in case we have to wait a long time for the travis CI apache queue.

@isnotinvain
Copy link
Contributor

Tests passed! merging now...

@asfgit asfgit closed this in d084ad2 Mar 5, 2015
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.

Author: Alex Levenson <alexlevenson@twitter.com>
Author: julien <julien@twitter.com>
Author: Julien Le Dem <julien@twitter.com>

Closes apache#98 from julienledem/avoid_wasting_64K_per_empty_buffer and squashes the following commits:

b0200dd [julien] add license
a1b278e [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer
5304ee1 [julien] remove unused constant
81e399f [julien] Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julienledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
ccf677d [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer
37148d6 [Julien Le Dem] Merge pull request #2 from isnotinvain/PR-98
b9abab0 [Alex Levenson] Address Julien's comment
965af7f [Alex Levenson] one more typo
9939d8d [Alex Levenson] fix typos in comments
61c0100 [Alex Levenson] Make initial slab size heuristic into a helper method, apply in DictionaryValuesWriter as well
a257ee4 [Alex Levenson] Improve IndexOutOfBoundsException message
64d6c7f [Alex Levenson] update comments
8b54667 [Alex Levenson] Don't use CapacityByteArrayOutputStream for writing page chunks
6a20e8b [Alex Levenson] Remove initialSlabSize decision from InternalParquetRecordReader, use a simpler heuristic in the column writers instead
3a0f8e4 [Alex Levenson] Use simpler settings for column chunk writer
b2736a1 [Alex Levenson] Some cleanup in CapacityByteArrayOutputStream
1df4a71 [julien] refactor CapacityByteArray to be aware of page size
95c8fb6 [julien] avoid wasting 64K per empty buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants