-
Notifications
You must be signed in to change notification settings - Fork 465
PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Author: Julien Le Dem <julien@twitter.com> Closes apache#8 from julienledem/master and squashes the following commits: aa0d751 [Julien Le Dem] correct mailling-list 285b39c [Julien Le Dem] Update CONTRIBUTING.md bca9bc4 [Julien Le Dem] Create CONTRIBUTING.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment: The delegatingFieldConsumer delegates addField call to corresponding TypedConsumer?
Conflicts: src/main/java/parquet/format/Util.java
|
@tsdeng I refactored the APIs a bit, added javadoc and resolved conflicts. |
|
I don't want to block this PR, but I do want to comment on the approach... While this will likely address the immediate problem (memory pressure when creating splits), having this giant thrift object in a footer is still an issue. For one thing, we have to write it -- the committer that writes the joined footer is probably also under memory pressure since it has to have the merged thrift object in memory. Also, this makes reading just one file from a directory that has 400 unnecessarily expensive (as expensive as initializing a read for all 400 files). The joined footer doesn't need to be a single object and we should probably move to a model where it's a sequence of objects, one per file, and we can choose which file's metadata we are reading. |
|
Agreed. The committer does only one partition at a time so it is much more manageable than when reading many partitions at once. (we had issues with a schema with thousands of columns reading dozens of partitions at a time). One thrift containing a list of Thrift object versus a sequence of objects, doesn't make any difference in my opinion as we can easily handle the top level object without having to materialize the list as in this PR. This could be applied in the committer for write as well so that it would not load everything in memory. I'm also proposing a new _common_metadata file, so that we don't have to skip the row groups when we don't need them: |
|
@dvryaboy Also if you think this is a good first step toward fixing this issue, feel free to merge. |
|
I'm lost in layers of abstraction... :) |
|
maybe I should throw in a monad or two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the Consumers here is just used as a namespace. So should it be a package name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed, it's providing some static util methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:
abstract class TypedConsumer<T> {
abstract public void consume(T value)
}
advantage of this is removing the duplication of addXXX declaration for each XXConsumer
also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:
- get the value from protocol
- call the consume method
So the code change would be:
abstract public static class TypedConsumer<T> {
final public void read(TProtocol protocol, EventBasedThriftReader reader) {
this.consume(getValue(protocol))
}
abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I prefer the name to be consume, rather than add. It's just consumer <-> consume reads better than consumer<->add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed a bunch of stuff according to your comment.
for the Generic TypedConsumer they actually don't all have the same signature (see Struct, Set, List, Map) so I don't think we should do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you missed this comment:
We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:
abstract class TypedConsumer<T> {
abstract public void consume(T value)
}
advantage of this is removing the duplication of addXXX declaration for each XXConsumer
also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:
- get the value from protocol
- call the consume method
So the code change would be:
abstract public static class TypedConsumer<T> {
final public void read(TProtocol protocol, EventBasedThriftReader reader) {
this.consume(getValue(protocol))
}
abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did reply to this one:
not all TypedConsumers do consume(T value). See {Set,List,Struct,Map}Consumer so that would not apply in that case.
|
LGTM +1 |
|
If you like it, please merge it. |
|
I'll merge. |
|
@julienledem using this to read footers is a separate issue, correct? This doesn't seem to. |
This pull request provides an API to read thrift in a streaming fashion.
This enables ignoring fields that are not needed without loading them into memory.
It also aloow treating the data as it comes instead of when it's fully loaded in memory.