Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jun 29, 2016

This fixes PARQUET-400 by replacing CompatibilityUtil with SeekableInputStream that's implemented for hadoop-1 and hadoop-2. The benefit of this approach is that SeekableInputStream can be used for non-Hadoop file systems in the future.

This also changes the default Hadoop version to Hadoop-2. The library is still compatible with Hadoop 1.x, but this makes building Hadoop-2 classes, like H2SeekableInputStream, much easier and removes the need for multiple hadoop versions during compilation.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 29, 2016

@piyushnarang, this is the update to your PR that I think may be a bit cleaner.

}

// Visisble for testing
static int readDirectBuffer(FSDataInputStream f, ByteBuffer buf, byte[] temp) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cut down on allocation, this uses an 8k buffer that's tied to the instance. We can change this and maybe even make it configurable to find out where a good trade-off is.

@piyushnarang
Copy link

Think for the most part this is similar to what we had in the other PR. Guess I'm not entirely clear on how this enables non Hadoop file systems in the future - you're still depending on FSDataInputStream and such in the SeekableInputStream. You could in the other implementation just derive CompatibilityReader right?

Taking a step back, does it make sense to break out the ParquetFileReader code to create a ParquetFileReader interface / abstract that has a Hadoop based implementation & other implementations? Feels like we're trying to retro fit this support in the ParquetFileReader cause it currently has the logic inline. If we pulled it out it might make things easier.

Think we should try and eliminate the reflection overhead (or benchmark it to confirm its low) if possible. With this implementation we call the reflection based ctor of the V2 reader everytime we create a ParquetFileReader.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 29, 2016

@piyushnarang, it's definitely similar, I just made a few changes to your PR.

The SeekableInputStream is a start for separating Parquet's internals from Hadoop classes, like FSDataInputStream. That's what other projects, like Avro, use to pass Hadoop or non-Hadoop input streams to the internals and this has been something we've been meaning to do for a while. Now seems like a good time. You're right that as long as it imports Hadoop classes, it doesn't help much so I've separated the helper method from the actual class. Maybe we should move SeekableInputStream to parquet-common as well?

I think we do eventually want to make a reader API that is independent of Hadoop, but I don't think we need to do it in this PR. I'm just trying to make this get us further along toward the goal of not needing the Hadoop API to use Parquet.

On reflection overhead: this has the same overhead that the previous implementations had, instantiating a class to handle Hadoop 2 streams. newV2Reader was looking up the class and calling newInstance. And, I don't think this is going to make much of a difference because it is once per stream.

@piyushnarang
Copy link

Yeah I personally prefer keeping such changes small and iterative. Would be nice if we just tackled fixing the immediate concern (reads are broken) first and then followed up with a change to make Parquet more friendly to non-Hadoop setups. That said, if you're keen on adding that as part of this fix let's go ahead. I can close my prior PR in favor of this.

Makes sense to move SeekableInputStream to parquet-common. Wouldn't make sense to take a dependency on parquet-hadoop if it isn't needed.

Let me know when you're happy with the implementation, I can take a more detailed look.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 29, 2016

@piyushnarang, I'd prefer to go with this approach since it solves the same problem and sets us up for separating the APIs later. I've added a big file of tests based on your MockInputStream so I think this is ready to go.

@rdblue rdblue changed the title PARQUET-400: Replace CompatibilityUtil with SeekableInputStream. [WIP] PARQUET-400: Replace CompatibilityUtil with SeekableInputStream. Jun 29, 2016
import java.io.InputStream;
import java.nio.ByteBuffer;

public abstract class SeekableInputStream extends InputStream {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add some javadoc on this abstract class and its purpose etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if SeekableInputStream is the right name? Maybe ParquetInputStream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This api is not really Parquet specific. It allows to seek and read blocks of the file.
Being in a parquet package is enough "parquet" in the name I think.

@julienledem
Copy link
Member

I made some comments. Thanks @piyushnarang and @rdblue for taking care of this

@rdblue rdblue force-pushed the PARQUET-400-byte-buffers branch from 80d5889 to c6ff434 Compare July 6, 2016 22:11
@rdblue
Copy link
Contributor Author

rdblue commented Jul 6, 2016

@piyushnarang, @julienledem, I've addressed the review items.

I tested the hadoop 2 readFully method for ByteBuffer using the same tests I wrote for the hadoop 1 implementation, but the test won't compile in hadoop-1 so we would have to create another module for hadoop-2 tests and exclude it from the hadoop 1 test run. I don't think testing the readFully method is worth the trouble because it is so simple and unlikely to change from its current working and tested state. The tests are available in 4f273e4 if you'd like to have a look, but they were removed in the next commit to get tests passing in Jenkins.

*/
public static SeekableInputStream wrap(FSDataInputStream stream) {
if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
byteBufferReadableClass.isInstance(stream.getWrappedStream())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this check is a bit fragile? It will work if the stream's immediate inner stream is an instance of ByteBufferReadable that has a concrete implementations of read(ByteBuffer buf). If that stream too ends up delegating to its inner stream then we might have a problem right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the wrapped class implemented ByteBufferReadable in order to delegate to an implementation that may not be, then it would be a problem and that would be caught by the previous version but not this method. But, I think it's fine to assume that when a stream implements ByteBufferReadable that it actually does. That's what ByteBufferReadable is intended to signal and it's weird that we have to implement a special case for FSDataInputStream at all.

Also, the previous implementation calls a method that isn't present in hadoop-1 and to do that relies on when methods are linked. I think the trade-off is worth this cleaner way of deciding whether to use the ByteBuffer interface, but I'm happy to change it if others agree with you that we should handle more levels of wrapper classes.

@piyushnarang
Copy link

@rdblue - given that we expect the bulk of people to use the hadoop 2 version of the code (correct me if I'm wrong), we should have the tests running. Even if we have to create a new module for the tests, I think it is probably worth it as this functionality is pretty important to ensuring Parquet reads work. Will give future developers something to use in case they need to change the hadoop 2 read code. Might have misunderstood so let me know if you had something else in mind..

@rdblue rdblue reopened this Jul 26, 2016
}
}

public static void readFully(Reader stream, ByteBuffer buf) throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename from stream to reader

@piyushnarang
Copy link

Thanks Ryan. Couple of minor comments but think it looks good to me.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 27, 2016

I fixed the nits that @piyushnarang pointed out. Anything else? @julienledem or @isnotinvain?

@isnotinvain
Copy link
Contributor

👍 Thanks @piyushnarang and @rdblue for tackling this!

+1

* {@code SeekableInputStream} is an interface with the methods needed by
* Parquet to read data from a file or Hadoop data stream.
*/
public abstract class SeekableInputStream extends InputStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add getLength() as well.
That's the only extra information we need to read the footer. It's missing in FSDataInputStream and that would simplify some code where we have to pass the FileStatus object along.
it is always available since we get the stream with FileStatus.open().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A seekable stream is a fairly normal construct, while a stream that knows its own length is irregular. I think we are probably better off having a slightly higher-level concept of a stream provider that knows the length of streams it opens. That would basically encapsulate FileStatus and FileSystem so you can pass a single object that can open parallel streams for a single Parquet file.

How about doing this as a follow up? This issue is a blocker for 1.9.0 so I'd like to get it in. We can discuss the right way to pass around the length but also work toward getting 1.9.0 out. I'll open an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened PARQUET-674.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@julienledem
Copy link
Member

+1

@isnotinvain
Copy link
Contributor

Looks like 76a2ac8 conflicts with this PR

@rdblue rdblue force-pushed the PARQUET-400-byte-buffers branch from 7009b6b to 1bcb8a8 Compare August 15, 2016 15:54
@rdblue
Copy link
Contributor Author

rdblue commented Aug 15, 2016

@isnotinvain, I rebased and fixed the conflicts, which were small. I'll commit this later today.

@julienledem
Copy link
Member

sounds good to me

@isnotinvain
Copy link
Contributor

Great, thanks!

@asfgit asfgit closed this in 898f3d0 Aug 16, 2016
@piyushnarang
Copy link

Thanks, glad this is now out :-)

rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This fixes PARQUET-400 by replacing `CompatibilityUtil` with `SeekableInputStream` that's implemented for hadoop-1 and hadoop-2. The benefit of this approach is that `SeekableInputStream` can be used for non-Hadoop file systems in the future.

This also changes the default Hadoop version to Hadoop-2. The library is still compatible with Hadoop 1.x, but this makes building Hadoop-2 classes, like `H2SeekableInputStream`, much easier and removes the need for multiple hadoop versions during compilation.

Author: Ryan Blue <blue@apache.org>

Closes apache#349 from rdblue/PARQUET-400-byte-buffers and squashes the following commits:

1bcb8a8 [Ryan Blue] PARQUET-400: Fix review nits.
823ca00 [Ryan Blue] PARQUET-400: Add tests for Hadoop 2 readFully.
02d3709 [Ryan Blue] PARQUET-400: Remove unused property.
b543013 [Ryan Blue] PARQUET-400: Fix logger for HadoopStreams.
2cb6934 [Ryan Blue] PARQUET-400: Remove H2SeekableInputStream tests.
abaa695 [Ryan Blue] PARQUET-400: Fix review items.
5dc50a5 [Ryan Blue] PARQUET-400: Add tests for H1SeekableInputStream methods.
730a9e2 [Ryan Blue] PARQUET-400: Move SeekableInputStream to io package.
506a556 [Ryan Blue] PARQUET-400: Remove Hadoop dependencies from SeekableInputStream.
c80580c [Ryan Blue] PARQUET-400: Handle UnsupportedOperationException from read(ByteBuffer).
ba08b3f [Ryan Blue] PARQUET-400: Replace CompatibilityUtil with SeekableInputStream.

Conflicts:
    parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
    pom.xml
Resolution:
    Fixed minor conflicts from using byte[] instead of ByteBuffer.
    Updated pom changes for current Pig version (PPD not backported).
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This fixes PARQUET-400 by replacing `CompatibilityUtil` with `SeekableInputStream` that's implemented for hadoop-1 and hadoop-2. The benefit of this approach is that `SeekableInputStream` can be used for non-Hadoop file systems in the future.

This also changes the default Hadoop version to Hadoop-2. The library is still compatible with Hadoop 1.x, but this makes building Hadoop-2 classes, like `H2SeekableInputStream`, much easier and removes the need for multiple hadoop versions during compilation.

Author: Ryan Blue <blue@apache.org>

Closes apache#349 from rdblue/PARQUET-400-byte-buffers and squashes the following commits:

1bcb8a8 [Ryan Blue] PARQUET-400: Fix review nits.
823ca00 [Ryan Blue] PARQUET-400: Add tests for Hadoop 2 readFully.
02d3709 [Ryan Blue] PARQUET-400: Remove unused property.
b543013 [Ryan Blue] PARQUET-400: Fix logger for HadoopStreams.
2cb6934 [Ryan Blue] PARQUET-400: Remove H2SeekableInputStream tests.
abaa695 [Ryan Blue] PARQUET-400: Fix review items.
5dc50a5 [Ryan Blue] PARQUET-400: Add tests for H1SeekableInputStream methods.
730a9e2 [Ryan Blue] PARQUET-400: Move SeekableInputStream to io package.
506a556 [Ryan Blue] PARQUET-400: Remove Hadoop dependencies from SeekableInputStream.
c80580c [Ryan Blue] PARQUET-400: Handle UnsupportedOperationException from read(ByteBuffer).
ba08b3f [Ryan Blue] PARQUET-400: Replace CompatibilityUtil with SeekableInputStream.

Conflicts:
    parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
    pom.xml
Resolution:
    Fixed minor conflicts from using byte[] instead of ByteBuffer.
    Updated pom changes for current Pig version (PPD not backported).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants