PARQUET-400: Fix for ByteBuffer incomplete read issue #346

piyushnarang · 2016-05-26T18:55:44Z

Spinning up a new PR as I don't have write permissions to update @jaltekruse 's existing PR. Fixed a few of the comments that were outstanding on that PR:

Use readFully for the fallback and check if byteBuffer.hasArray() is true.
Updated getBuf to read all the remaining bytes into the byte buffer. While testing this, I noticed that on our Hadoop (2.x) cluster we end up returning fewer bytes than byteBuffer.remaining(), so I've added a loop to ensure we get all the remaining bytes. Also seems to line up with the javadoc for FSDataInputStream.read(byteBuffer).
Updated the fallback logic. Create a CompatUtil per parquet file. So if we fail to use the byteBuffer based API once, we end up trying again for the next file. To handle scenarios where users are on Hadoop 1.x and this checking on every file is unnecessary, I've added an option using which they can turn it off. (Open to nicer ways of doing this..)

…ng Hadoop 2.x The problem was not handling the case where a read request returns less than the requested number of bytes. The FSDataInputStream lacks an API equivalent for readFully when using ByteBuffers, which used to solve this problem when using byte arrays as the destination. This has been fixed by including a loop to manually request the remaining bytes until everything has been read.

… impl

Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

piyushnarang · 2016-05-26T18:56:16Z

@rdblue / @danielcweeks / @jaltekruse - please take a look

piyushnarang · 2016-06-07T22:24:47Z

Ping @rdblue, can you take a look please?

isnotinvain · 2016-06-10T03:31:46Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

  public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism";

+  // configure if we want to use Hadoop's V2 read(bytebuffer). If true, we try to read using the
+  // new Hadoop read(ByteBuffer) api. Else, we skip.


can you put some more comments here about what this v2 read is / how it works / why you would want this on or off?

added some comments. Let me know if I can clarify more.

isnotinvain · 2016-06-10T03:41:00Z

Invoking the read method via reflection could itself be slow right? I don't have an intuition on how slow that can be, but for such a low level thing as this it seems surprising that we'd do this.

instead of having this Compatibility helper class that has v1/v2 switches all over the place, can we just make an interface (or abstract class) for these operations w/ 2 implementations? Then we just pick an implementation at startup and don't need any if v1 / if v2 logic. Also IIRC you get some decent performance optimizations if you only ever class load a single implementation of a particular interface / abstract class, so it should be better than calling invoke all the time.

isnotinvain · 2016-06-10T03:42:01Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

+  private int readWithByteBuffer(FSDataInputStream f, ByteBuffer readBuf) throws IOException {
+    int remaining = readBuf.remaining();
+    try {
+      while (readBuf.hasRemaining()) {


is there no readFully() method we can call that handles this for us? that seems surprising.

Yeah not that I'm aware of https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/FSDataInputStream.html#read(java.nio.ByteBuffer)
Reads up to buf.remaining() bytes into buf. Callers should use buf.limit(..) to control the size of the desired read.

piyushnarang · 2016-06-18T01:56:43Z

@isnotinvain - I like your idea of skipping using reflection. Will look into creating a parquet-hadoop2 module in the project that depends on hadoop 2.x so that we have the readFully(ByteBuffer) available. If I understand correctly though, we'll still need reflection at startup (to figure out which concrete implementation of the interface we want to load up).

piyushnarang · 2016-06-21T01:51:07Z

@isnotinvain - updated the implementation to skip reflection on the individual read calls. Just using reflection at the start to figure out which of the two interfaces to use. Tested this out with v1 reads & v2 reads with a hadoop job - seems to work ok.

isnotinvain · 2016-06-24T01:49:38Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityReader.java

+   * @throws EOFException if readBuf.remaining() is greater than the number of bytes available to
+   * read on the FSDataInputStream f.
+   */
+  int readBuf(FSDataInputStream f, ByteBuffer readBuf) throws IOException;


should this be called readFully?

yeah can rename it to readFully

isnotinvain · 2016-06-24T01:52:01Z

Do we need the configuration property for whether to use v1 or v2? It seems like we can auto-detect this, so any reason for the added config?

piyushnarang · 2016-06-24T03:05:39Z

The rationale for the config property was to avoid performing the reflection based check for every file attempted to be read by Parquet in some scenarios. If you know you're running Parquet in a v1 based setup you could specify that you don't want the byteBuffer based read and directly end up using the V1 APIs. We could do the check once and store the result as a static member variable but that means you're constrained to that value for the entire JVM runtime. I believe @rdblue brought that up as a concern on the previous version of this PR - #306.
Open to other suggestions as well..

rdblue · 2016-06-28T22:34:53Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

      List<Chunk> result = new ArrayList<Chunk>(chunks.size());
      f.seek(offset);
+
+      //Allocate the bytebuffer based on whether the FS can support it.


Nit: There should be a space before "Allocate".

piyushnarang · 2016-06-29T19:44:58Z

Closing in favor of: #349

jaltekruse and others added 11 commits January 6, 2016 18:07

Create blacklist for FileSystems that don't work well with bytebuffer…

da5c5c9

… impl

remove unused parameter

f35c772

WIP addressing comments

96406d8

Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

Fix infinite loop bug caused by not updating bytebuffer position.

e800f20

Merge branch 'master' into piyush/PARQUET-400-bytebuffer-fix

2649270

Switch to readFully in compatUtil, remove loop

8460659

More logging

41807ee

Loop in reader

79e1d23

Minor updates

29b2452

Add some tests, refactor code a bit

cab228f

isnotinvain reviewed Jun 10, 2016
View reviewed changes

Piyush Narang added 3 commits June 17, 2016 18:56

Fix minor comments

9225171

Add parquet-hadoop2

d7caf10

Get rid of reflection in compatUtil's read call

dbf9fcc

isnotinvain reviewed Jun 24, 2016
View reviewed changes

Rename to readFully

1a3f7a2

Add some comments

9c76555

rdblue reviewed Jun 28, 2016
View reviewed changes

piyushnarang closed this Jun 29, 2016

PARQUET-400: Fix for ByteBuffer incomplete read issue #346

PARQUET-400: Fix for ByteBuffer incomplete read issue #346

Uh oh!

Conversation

piyushnarang commented May 26, 2016

Uh oh!

piyushnarang commented May 26, 2016

Uh oh!

piyushnarang commented Jun 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyushnarang commented Jun 18, 2016

Uh oh!

piyushnarang commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 24, 2016

Uh oh!

piyushnarang commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyushnarang commented Jun 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants