Skip to content

Conversation

@nielsbasjes
Copy link
Contributor

When writing Avro records into Parquet files I see a log of logging about the columns that have been written. In the scenario I'm working on (streaming Apache Flink, many files, large schema) this becomes too much logging.

I added an additional option when creating a ParquetWriter instance (like the AvroParquetWriter) where you can control this logging by adding a ".withColumnInfoLogging(false)" to the builder.

I would like to write a unit test for this. To do this I can use some directions on how to capture the logging output in a unit test.

        ParquetWriter<Record> parquetWriter = AvroParquetWriter
                .<Record>builder(outputFile)
                .withSchema(Record.getClassSchema())
                .withCompressionCodec(CompressionCodecName.GZIP)
                .withDictionaryEncoding(true)
                .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
                .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                .withColumnInfoLogging(false)
                .build();

@nielsbasjes
Copy link
Contributor Author

Note that I deliberately made the default behavior the same as the current version.

@rdblue
Copy link
Contributor

rdblue commented Jan 12, 2016

@nielsbasjes, thanks for looking into this to fix the annoyance!

I'm wondering if an API-level solution is the right choice, though. Do you think that the logs you're suppressing here would be better as DEBUG logs instead of INFO? That would delegate to the logging framework rather than exposing this option, which I think people would generally prefer to have off by default.

@julienledem, any comment on whether this could be DEBUG instead of INFO?

@nielsbasjes
Copy link
Contributor Author

From my standpoint I see a few viable scenarios.

  • Simply switch this message to DEBUG instead of INFO. This makes this logging only available to people who are using a custom built version of Parquet.
  • Add a flag to enable/disable this specific logging (what I did here). This makes this logging optional and settable by the application developer.
  • Make the logging level 'settable' for the entire Parquet library. This allows application developers to fully control what logging they get without the need to hack&rebuild the stack (which is hard due to things like SemVer, JDK8 which I ran into). I do not understand yet why the logging level was hardcoded in the way it is now. Perhaps this allows the compiler to optimize the logging statements out completely? The only relevant mention I could find on this is this post https://groups.google.com/forum/#!topic/parquet-dev/Eb0btS78M5g

@lw-lin
Copy link
Contributor

lw-lin commented Jan 14, 2016

hi @nielsbasjes Thanks for contributing this.

CI seems to be failing due to:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile (default-testCompile) on project parquet-hadoop: Compilation failure: Compilation failure:
[ERROR] /home/travis/build/apache/parquet-mr/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnChunkPageWriteStore.java:[105,43] constructor ColumnChunkPageWriteStore in class org.apache.parquet.hadoop.ColumnChunkPageWriteStore cannot be applied to given types;
[ERROR] required: org.apache.parquet.hadoop.CodecFactory.BytesCompressor,org.apache.parquet.schema.MessageType,org.apache.parquet.bytes.ByteBufferAllocator,boolean
[ERROR] found: org.apache.parquet.hadoop.CodecFactory.BytesCompressor,org.apache.parquet.schema.MessageType,org.apache.parquet.bytes.HeapByteBufferAllocator
[ERROR] reason: actual and formal argument lists differ in length
[ERROR] /home/travis/build/apache/parquet-mr/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnChunkPageWriteStore.java:[164,39] constructor ColumnChunkPageWriteStore in class org.apache.parquet.hadoop.ColumnChunkPageWriteStore cannot be applied to given types;
[ERROR] required: org.apache.parquet.hadoop.CodecFactory.BytesCompressor,org.apache.parquet.schema.MessageType,org.apache.parquet.bytes.ByteBufferAllocator,boolean
[ERROR] found: org.apache.parquet.hadoop.CodecFactory.BytesCompressor,org.apache.parquet.schema.MessageType,org.apache.parquet.bytes.HeapByteBufferAllocator
[ERROR] reason: actual and formal argument lists differ in length

Details are here: https://travis-ci.org/apache/parquet-mr/jobs/101882142#L2900

Could you fix that accordingly, please? Cheers.

@nielsbasjes
Copy link
Contributor Author

@proflin Yes, i'll fix that. But I think that right now it may be better to wait and see if @rdblue and @julienledem think this is the right solution direction. If they agree it is then I'll fix everything.
Perhaps you can point me to a way this can be unit tested? I.e. How can I check the messages that are sent to the Logging framework in a unit test?

@rdblue
Copy link
Contributor

rdblue commented Jan 14, 2016

@nielsbasjes, we've recently changed to using SLF4J as the logging back-end, which should be released in 1.9.0. That means that as long as we haven't optimized the logging statements out of the compiled version, you should be able to control seeing that info. The statements get optimized out when the if statement is essentially if (false) logger.info(...). But we can just get rid of that if statement and rely on the logging framework.

@nielsbasjes
Copy link
Contributor Author

For the 'complex' logging statements (like the one I ran into): If we replace the
if (false) logger.info(...)
with
if (logger.isInfoEnabled()) logger.info(...)
I expect the changes to be much simpler than my first attempt and that the performance impact is limited.
For the 'simple' statements I think we can rely on the built in optimizations in combination with the templating feature which is part of slf4j.
What do you think?

@nielsbasjes
Copy link
Contributor Author

I found that in parquet-common there is a org.apache.parquet.Log class that explicitly states:

 * Simple wrapper around java.util.logging
 * Adds compile time log level.
 * The compiler removes completely if statements that reference to a false constant
 *
 * <code>
 *   if (DEBUG) LOG.debug("removed by the compiler if DEBUG is a false constant")
 * </code>

To me this "Adds compile time log level" is an unwanted feature.
@julienledem Why was this 'compile time' idea implemented? Performance?

@nielsbasjes
Copy link
Contributor Author

Note: I still have to fully test this

@rdblue
Copy link
Contributor

rdblue commented Apr 17, 2016

@nielsbasjes, just checking in. If you have time to test this, feel free to ping me for a review.

@julienledem
Copy link
Member

@nielsbasjes do you want to follow up on this PR?

@nielsbasjes
Copy link
Contributor Author

Yes.
I'm diving into this one as we speak.
I'll keep you posted on my progress.

@nielsbasjes
Copy link
Contributor Author

I found that TestCircularReferences shows a StackOverflow for which I created a separate ticket at AVRO.
https://issues.apache.org/jira/browse/AVRO-1923

rdblue and others added 13 commits September 23, 2016 11:26
This implements the read-side compatibility rules for 2-level and 3-level lists in Thrift.

Thrift doesn't allow null elements inside lists, but 3-level lists may have optional elements. This PR adds a property, parquet.thrift.ignore-null-elements, that allows thrift to read lists with optional elements by ignoring nulls. This is off by default, but is provided as an opt-in for compatibility with data written by Hive.

Thrift's schema conversion does not change because a Thrift class (or Scrooge etc.) must be set in a file's metadata or provided when constructing a reader.

This replaces and closes #144.

Author: Ryan Blue <blue@apache.org>

Closes #300 from rdblue/PARQUET-212-fix-thrift-3-level-lists and squashes the following commits:

ac7c405 [Ryan Blue] PARQUET-212: Add tests for list of list cases from PARQUET-364.
356fdb7 [Ryan Blue] PARQUET-212: Rename isElementType => isListElementType.
5d3b094 [Ryan Blue] PARQUET-212: Fix list handling with projection.
b5f207f [Ryan Blue] PARQUET-212: Add Configuration to the ThriftRecordConverter ctor.
b87eb65 [Ryan Blue] PARQUET-212: Add property to ignore nulls in lists.
3d1e92f [Ryan Blue] PARQUET-212: Update thrift reads for LIST compatibility rules.
0bf2b45 [Ryan Blue] PARQUET-212: Read non-thrift files if a Thrift class is supplied.
4e148dc [Ryan Blue] PARQUET-212: Add DirectWriterTest base class.
…odule encoding, column, and hadoop

Codes change now and then, but some corresponding doc comments are left out.

This PR fixes only the doc comments that should have been changed. It should be OK, since none codes are touched.

@rdblue could you take a look please? Cheers.

Author: proflin <proflin.me@gmail.com>

Closes #307 from proflin/Minor--Fix-the-mismatch-of-the-parameters-and-their-doc-comments-in-module-encoding,-column,-and-hadoop and squashes the following commits:

34c7b01 [proflin] Minor: Fix the mismatch of the parameters and their doc comments in module encoding, column, and hadoop
… and overwrite the initial value of a method parameter

In org.apache.parquet.schema.MessageTypeParser, for addGroupType() and addPrimitiveType(), the initial value of this parameter t is ignored, and t is overwritten here.

This often indicates a mistaken belief that the write to the parameter will be conveyed back to the caller.

This is a bug found by FindBugs™.

Author: proflin <proflin.me@gmail.com>

Closes #308 from proflin/PARQUET-422 and squashes the following commits:

df1f908 [proflin] PARQUET-422: Fix a potential bug in MessageTypeParser where we ignore and overwrite the initial value of a method parameter
Author: Ryan Blue <blue@apache.org>

Closes #303 from rdblue/PARQUET-393-update-parquet-format-version and squashes the following commits:

0e4c798 [Ryan Blue] PARQUET-393: Add TIME_MICROS and TIMESTAMP_MICROS.
ca4a741 [Ryan Blue] PARQUET-393: Update to parquet-format 2.3.1.
The ticket proposes to consider the case *path.length < o.path.length* in, for method ColumnDescriptor.compareTo().

Author: proflin <proflin.me@gmail.com>

Closes #314 from proflin/PARQUET-432 and squashes the following commits:

80ba94b [proflin] Addresses PR comments
6ccd00f [proflin] Revert Updates
a4d2a4a [proflin] PARQUET-432: Complete a todo in method ColumnDescriptor.compareTo()
694b76b [proflin] Updates
The code in parquet-cascading is adapted to the API as of Cascading 2.5.3

Some incompatible changes were introduced in Cascading 3.0. This patch forks the parquet-cascading module to also provide a parquet-cascading3 module, which is about identical save for overloads which changed from requiring a Foo<JobConf> to requiring a Foo<? extends JobConf>

Author: Cyrille Chépélov (TP12) <cch@transparencyrights.com>

Closes #284 from cchepelov/try_cascading3 and squashes the following commits:

e7d1304 [Cyrille Chépélov (TP12)] Adding a @deprecated notice on parquet-cascading's remaining classes
05a417d [Cyrille Chépélov (TP12)] cascading2/3: share back TupleWriteSupport.java (accidentally unmerged)
7fff2d4 [Cyrille Chépélov (TP12)] cascading/cascading3: remove duplicates, push common files into parquet-cascading-common23
338a416 [Cyrille Chépélov (TP12)] Removing unwanted file (what?!) + .gitignoring this kind of files
d9f0455 [Cyrille Chépélov (TP12)] TupleEntry#get is now TupleEntry#getObject
a7f490a [Cyrille Chépélov (TP12)] Revert "Missing test conversion to Cascading 3.0"
cc8b870 [Cyrille Chépélov (TP12)] Missing test conversion to Cascading 3.0
2d73512 [Cyrille Chépélov (TP12)] conflicting values can come in one order or the other. Accept both.
33355d5 [Cyrille Chépélov (TP12)] Fix version mismatch (duh!)
7128639 [Cyrille Chépélov (TP12)] non-C locale can break tests implementation (decimal formats)
53aa2f9 [Cyrille Chépélov (TP12)] Adding a parquet-cascading3 module (forking the parquet-cascading module and accounting for API changes)
To produce
> required group User {
    required int64 id;
    **optional** binary email (UTF8);
 }

we should do:
>
Types.requiredGroup()
      .required(INT64).named("id")
      .~~**required** (BINARY).as(UTF8).named("email")~~
      .**optional** (BINARY).as(UTF8).named("email")
      .named("User")

@rdblue @liancheng would you mind taking a look at it when you have time? Thanks!

Author: Liwei Lin <proflin.me@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #317 from proflin/PARQUET-495--Fix-mismatches-in-Types-class-comments and squashes the following commits:

f26d57d [Liwei Lin] PARQUET-495: Fix mismatches in Types class comments
839b458 [proflin] Merge remote-tracking branch 'refs/remotes/apache/master'
This removes the option that redirects stderr to stdout because it
causes git push to hang.

Author: Ryan Blue <blue@apache.org>

Closes #302 from rdblue/PARQUET-410-fix-subprocess-hang and squashes the following commits:

340c316 [Ryan Blue] PARQUET-410: Fix hanging subprocess call in merge script.
This also adds a test to validate that serialization works for all
Binary objects that are already test cases.

Author: Ryan Blue <blue@apache.org>

Closes #305 from rdblue/PARQUET-415-fix-bytebuffer-binary-serialization and squashes the following commits:

4e75d54 [Ryan Blue] PARQUET-415: Fix ByteBuffer Binary serialization.
This PR fixes the args passed to the `String.format()` call.

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #320 from nezihyigitbasi/debug_args and squashes the following commits:

43a6088 [Nezih Yigitbasi] Fix args passed to string format calls
This PR fixes strict mode schema merging. To merge two `PrimitiveType` `t1` and `t2`, they must satisfy the following conditions:

1. `t1` and `t2` have the same primitive type name
1. `t1` and `t2` either
   - don't have original type, or
   - have the same original type
1. If `t1` and `t2` are both `FIXED_LEN_BYTE_ARRAY`, they should have the same length

Also, merged schema now preserves original name if there's any.

Author: Cheng Lian <lian@databricks.com>

Closes #315 from liancheng/fix-strict-schema-merge and squashes the following commits:

a29138c [Cheng Lian] Addresses PR comment
1ac804e [Cheng Lian] Fixes strict schema merging
…pperCase()/toLowerCase

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:
- http://stackoverflow.com/questions/11063102/using-locales-with-javas-tolowercase-and-touppercase
- http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
- http://java.sys-con.com/node/46241

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.*ENGLISH*)/toLowerCase(*Locale.ENGLISH*)

Author: proflin <proflin.me@gmail.com>

Closes #312 from proflin/PARQUET-430 and squashes the following commits:

ed55822 [proflin] PARQUET-430
Currently ParquetOutputFormat.getRecordWriter() contains an unsynchronized lazy initialization of the non-volatile static field *memoryManager*.

Because the compiler or processor may reorder instructions, threads are not guaranteed to see a completely initialized object, when ParquetOutputFormat.getRecordWriter() is called by multiple threads.

This PR makes *memoryManager* volatile to correct the problem.

Author: Liwei Lin <proflin.me@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #313 from proflin/PARQUET-431 and squashes the following commits:

1aa4a44 [Liwei Lin] empty commit to trigger CI
5e94fa3 [Liwei Lin] Remove the volatile modifier for memoryManager
d54bb99 [Liwei Lin] Undo the Deprecated anotation
fd1df4e [Liwei Lin] Adds synchronization around the creation of memoryManager as well as getMemoryManager()
615aa5a [proflin] PARQUET-431
rdblue and others added 27 commits September 23, 2016 11:26
Range filtering should use the row group midpoint and offset filtering
should use the start offset.

Author: Ryan Blue <blue@apache.org>

Closes #337 from rdblue/PARQUET-569-fix-metadata-filter and squashes the following commits:

6171af4 [Ryan Blue] PARQUET-569: Add tests for new offset metadata filter.
3fe2d5e [Ryan Blue] PARQUET-569: Separate metadata filtering for ranges and offsets.
Reads of the `finishCalled` variable are properly synchronized, but writes are not -- so there's some sort of inconsistent synch. going on here. This PR fixes that.

/cc @rdblue can you please take a look?

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #334 from nezihyigitbasi/sc-synch-fix and squashes the following commits:

a85cf0c [Nezih Yigitbasi] Synchronize writes to the finishCalled variable
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes #275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
Added JsonRecordFormatter which formats SimpleRecords into an structure that can be used with ObjectMapper to create a valid json structure. Unit test included.

Author: Reuben Kuhnert <reuben.kuhnert@cloudera.com>

Closes #281 from sircodesalotOfTheRound/fix-parquet-cat and squashes the following commits:

67207ef [Reuben Kuhnert] PARQUET-367: "parquet-cat -j" doesn't show all records.
The closeable interface states:
> Closes this stream and releases any system resources associated with it. If the stream is already closed then invoking this method has no effect.

As InternalParquetRecordWriter implements this interface we should adhere to this contract.

Author: Mark Reddy <mark.l.reddy@gmail.com>

Closes #345 from markreddy/PARQUET-544-adhere-to-closeable-contract and squashes the following commits:

135db9b [Mark Reddy] PARQUET-544: add closed flag to allow for adherence to closeable contract
This fixes how null is handled by `DictionaryFilter` for equals predicates. Null is never in the dictionary and is encoded by the definition level, so the `DictionaryFilter` would never find the value in the dictionary and would incorrectly filter row groups whenever the filter was `col == null`.

Author: Ryan Blue <blue@apache.org>

Closes #348 from rdblue/PARQUET-645-fix-null-dictionary-filter and squashes the following commits:

ae8dd41 [Ryan Blue] PARQUET-645: Fix null handling in DictionaryFilter.
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: 6b605a4 and 6b24a1d (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.

Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).

Seems like this seems to be due to the encoding / decoding of Strings in the [Binary class](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java):
[toStringUsingUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L388) - for reads
[encodeUTF8()](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L236) - for writes

With these changes we see around 5% improvement in MB_Millis while running the job on our Hadoop cluster.

Added some microbenchmark details to the jira.

Note that I've left the behavior the same for the avro write path - it still uses CharSequence and the Charset based encoders.

Author: Piyush Narang <pnarang@twitter.com>

Closes #347 from piyushnarang/bytebuffer-encoding-fix-pr and squashes the following commits:

43c5bdd [Piyush Narang] Keep avro on char sequence
2d50c8c [Piyush Narang] Update Binary approach
9e58237 [Piyush Narang] Proof of concept fixes
Author: Ryan Blue <blue@apache.org>

Closes #343 from rdblue/PARQUET-612-test-compression and squashes the following commits:

a5b7dbb [Ryan Blue] PARQUET-612: Add compression codec to FileEncodingsIT.
This can be used by frameworks that use codegen for filtering to avoid
running filters within Parquet.

Author: Ryan Blue <blue@apache.org>

Closes #353 from rdblue/PARQUET-654-add-record-level-filter-option and squashes the following commits:

b497e7f [Ryan Blue] PARQUET-654: Add option to disable record-level filtering.
Fix 404 links in documentation

Author: Nihed MBAREK <nihedmm@gmail.com>

Closes #350 from nihed/patch-1 and squashes the following commits:

4ebe950 [Nihed MBAREK] Update README.md
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written.

A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization.

Author: Ryan Blue <blue@apache.org>

Closes #354 from rdblue/PARQUET-389-predicate-push-down-on-missing-columns and squashes the following commits:

b4d809a [Ryan Blue] PARQUET-389: Support record-level filtering with missing columns.
91b841c [Ryan Blue] PARQUET-389: Add missing column support to StatisticsFilter.
275f950 [Ryan Blue] PARQUET-389: Add missing column support to DictionaryFilter.
This fixes:
* parquet-cascading3 should have a libthrift dependency that uses thrift.version
* parquet-cascading3 should have the standard SLF4J dependencies
* Twitter's maven repo is no longer necessary because Parquet uses the Apache maven-thrift-plugin

Author: Ryan Blue <blue@apache.org>

Closes #328 from rdblue/PARQUET-540-fix-cascading3-build and squashes the following commits:

b8e509b [Ryan Blue] PARQUET-540: Add a constant for maven-thrift-plugin version.
c976e04 [Ryan Blue] PARQUET-540: Fix Cascading 3 build thrift and SLF4J.
The Avro implementation needs to check whether the read schema that is
passed by the user (or automatically converted from the file schema)
expects an extra 1-field layer to be returned, which matches the
previous behavior of Avro when reading a 3-level list. Before this
commit, the check was done by testing the structure of the expected list
element type against the repeated group's schema. If they matched, then
Avro assumed that the user expected an extra layer. However, for records
that happened to match (1-field records with a field named "element")
the check could be wrong and would cause exceptions later.

This commit updates the check to convert the file's element schema to
Avro and compare the compatibility of that schema with what was passed
by the user. This checks the entire tree from the element down and gets
the answer right based on the element and its children, not just the
field names on the element.

Author: Ryan Blue <blue@apache.org>

Closes #352 from rdblue/PARQUET-651-improve-is-element-type-check and squashes the following commits:

ad9c1ee [Ryan Blue] PARQUET-651: Undo accidental default setting change.
1efa248 [Ryan Blue] PARQUET-651: Improve Avro's isElementType check.
This relocates the DevNullValuesWriter and ZeroIntegerValuesReader,
which are used but are not related to the boundedint code.

Author: Ryan Blue <blue@apache.org>

Closes #329 from rdblue/PARQUET-543-remove-boundedint and squashes the following commits:

0158c51 [Ryan Blue] PARQUET-543: Update new import in ParquetProperties.
550a1a3 [Ryan Blue] PARQUET-543: Remove unused boundedint package.
We have two out of date lists of committers in .md files in the repo.
This PR changes them to point to the apache website which should stay up to date automatically.

Author: Alex Levenson <alexlevenson@twitter.com>

Closes #355 from isnotinvain/alexlevenson/committers-update and squashes the following commits:

85945b9 [Alex Levenson] Update committers lists to point to apache website
This commit fixes an issue when the number of entries in a column page is larger than the size of an integer. No exception is thrown directly, but the def level is set incorrectly, leading to a null value being returned during read.

Author: Michal Gorecki <goreckim@amazon.com>

Closes #321 from goreckm/int-overflow and squashes the following commits:

d224815 [Michal Gorecki] enhancing exception message
7334be2 [Michal Gorecki] PARQUET-511: Integer overflow when counting values in column.
https://issues.apache.org/jira/browse/PARQUET-668

1. Added option `--disable-crop`
2. Updated `README.md` to reflect changes

Author: djhworld <djharperuk@gmail.com>

Closes #358 from djhworld/master and squashes the following commits:

493c3d0 [djhworld] PARQUET-668: Removed usage instructions from README, replaced with --help flag
696a5e6 [djhworld] PARQUET-668 -> Updated README.md to fix issue in usage string
6cbf59b [djhworld] PARQUET-668 - Provide option to disable auto crop feature in DumpCommand output
…eams

The use case is that I want to reuse existing listing of files and avoid doing it again when opening streams. This is in case where filesystem.open is expensive but you have other means of obtaining input stream for a file.

Author: Robert Kruszewski <robertk@palantir.com>

Closes #357 from robert3005/robertk/allow-reading-footers-from-streams and squashes the following commits:

4d8a54c [Robert Kruszewski] allow reading footers from provided file listing and streams
Author: Alex Levenson <alexlevenson@twitter.com>

Closes #360 from isnotinvain/alexlevenson/committers-update-2 and squashes the following commits:

20782fa [Alex Levenson] Add back + update committers table
### Context:
Parquet is currently structured to choose the appropriate value writer based on the type of the column as well as the Parquet version. As of now, the writer(s) (and hence encoding) for each data type is hard coded in the Parquet source code.

This PR adds support for being able to override the encodings per type via config. That allows users to experiment with various encoding strategies manually as well as enables them to override the hardcoded defaults if they don't suit their use case.

We can override encodings per data type (int32 / int64 / ...).
Something on the lines of:
```
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
```

As an example:
```
"parquet.writer.encoding-override.int32" = "plain"
(Chooses Plain encoding and hence the PlainValuesWriter).
```

When a primary + fallback need to be specified, we can do the following:
```
"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
(Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY encoding as the fallback and hence creates a FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter).
```

In such cases we can mandate that the first encoding listed must allow for Fallbacks by implementing [RequiresFallback](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31).

### PR notes:

- Restructured the ValuesWriter creation code. Pulled it out of ParquetProperties into a new class and refactored the flow based on type as it was getting hard to follow and I felt adding the overrides would make it harder. Added a bunch of unit tests to verify the ValuesWriter we create for combinations of type, parquet version and dictionary on / off.
- Added unit tests to verify parsing of the encoding overrides + creation of ValuesWriters based on these overrides.
- Manually tested some encoding overrides scenarios out on Hadoop (both parquet v1, v2).

Author: Piyush Narang <pnarang@twitter.com>

Closes #342 from piyushnarang/master and squashes the following commits:

3ebab28 [Piyush Narang] Remove Configurable
149bb98 [Piyush Narang] Switch to getValuesWriterFactory call to non-static
0b78e04 [Piyush Narang] Address Ryan's feedback
1da6ca3 [Piyush Narang] Merge branch 'master' into piyush/dynamic-encoding-overrides
f021ed2 [Piyush Narang] Tweak comment in ValuesWriterFactory
cb02ea0 [Piyush Narang] Fix review comments
bf4bc6d [Piyush Narang] Add support for Config setting in ValuesWriter factory
8a852a3 [Piyush Narang] Log values writer factory chosen
e4b61a4 [Piyush Narang] Tweak factory instantiation a bit
b46cccd [Piyush Narang] Add class based factory override
6a5428f [Piyush Narang] Clean up some stuff in ValuesWriterFactory
0f8cd09 [Piyush Narang] Refactor mockito version
9ead61d [Piyush Narang] Add guava test dep
5c636c7 [Piyush Narang] Add encoding-overrides config to ParquetOutputFormat config
b9d6c13 [Piyush Narang] Refactor code in ValuesWriterFactory a bit
ff4c90d [Piyush Narang] Pull out value writer creation to ValuesWriterFactory and add unit tests
This has been pending for a while. With the other fix referenced in PARQUET-146, I think Parquet can now move to Java 7.

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #231 from nezihyigitbasi/parquet-java7 and squashes the following commits:

0b7130a [Nezih Yigitbasi] Compile with Java 7
This fixes PARQUET-400 by replacing `CompatibilityUtil` with `SeekableInputStream` that's implemented for hadoop-1 and hadoop-2. The benefit of this approach is that `SeekableInputStream` can be used for non-Hadoop file systems in the future.

This also changes the default Hadoop version to Hadoop-2. The library is still compatible with Hadoop 1.x, but this makes building Hadoop-2 classes, like `H2SeekableInputStream`, much easier and removes the need for multiple hadoop versions during compilation.

Author: Ryan Blue <blue@apache.org>

Closes #349 from rdblue/PARQUET-400-byte-buffers and squashes the following commits:

1bcb8a8 [Ryan Blue] PARQUET-400: Fix review nits.
823ca00 [Ryan Blue] PARQUET-400: Add tests for Hadoop 2 readFully.
02d3709 [Ryan Blue] PARQUET-400: Remove unused property.
b543013 [Ryan Blue] PARQUET-400: Fix logger for HadoopStreams.
2cb6934 [Ryan Blue] PARQUET-400: Remove H2SeekableInputStream tests.
abaa695 [Ryan Blue] PARQUET-400: Fix review items.
5dc50a5 [Ryan Blue] PARQUET-400: Add tests for H1SeekableInputStream methods.
730a9e2 [Ryan Blue] PARQUET-400: Move SeekableInputStream to io package.
506a556 [Ryan Blue] PARQUET-400: Remove Hadoop dependencies from SeekableInputStream.
c80580c [Ryan Blue] PARQUET-400: Handle UnsupportedOperationException from read(ByteBuffer).
ba08b3f [Ryan Blue] PARQUET-400: Replace CompatibilityUtil with SeekableInputStream.
A merge command for parquet-tools based on https://issues.apache.org/jira/browse/PARQUET-382.

Author: flykobe <flykobecy@gmail.com>

Closes #327 from flykobe/merge_tool and squashes the following commits:

b031c18 [flykobe] check input files
da28832 [flykobe] merge multi parquet files to one file
Author: Julien Le Dem <julien@dremio.com>

Closes #364 from julienledem/fix_travis and squashes the following commits:

3e80893 [Julien Le Dem] update README.md
958e0ef [Julien Le Dem] fix travis build. Broken because google code shut down
Previously, this passed the skip to the underlying readers, but would
not update previous and would corrupt values or cause exceptions.

Author: Ryan Blue <blue@apache.org>

Closes #366 from rdblue/PARQUET-623-fix-delta-byte-array-skip and squashes the following commits:

f85800c [Ryan Blue] PARQUET-623: Fix DeltaByteArrayReader#skip.
Currently, converting protobuf messages with extension can result in an uninformative error or a data corruption. A more detailed explanation in the corresponding [jira](https://issues.apache.org/jira/browse/PARQUET-660).

This patch simply ignores extension fields in protobuf messages.

In the longer run, I'd like to add a proper support for Protobuf extensions. This might take a little longer though, so I've decided to improve the current situation with this patch.

Author: Jakub Kukul <jakub.kukul@gmail.com>

Closes #351 from jkukul/master and squashes the following commits:

27580ab [Jakub Kukul] PARQUET-660: Throw Unsupported exception for messages with extensions.
db6e08b [Jakub Kukul] PARQUET-660: Refactor: Don't use additional variable for indexing fieldWriters.
e910a8a [Jakub Kukul] PARQUET-660: Refactor: Add missing indentation.
@nielsbasjes
Copy link
Contributor Author

Ok, that git action went totally wrong.
I'm creating a new pull request.

@nielsbasjes
Copy link
Contributor Author

New PR is #369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.