Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for column projection to parquet sources #1056

Merged
merged 5 commits into from
Sep 22, 2014

Conversation

isnotinvain
Copy link
Contributor

This is similar to #1050, but adds another method .withColumns(...) to parquet sources for specifying projection push down.

protected def copyWithColumnGlobs(columnGlobs: Set[ColumnProjectionGlob]): This
}

case class ColumnProjectionGlob(glob: String) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extends AnyVal should work here.

@johnynek
Copy link
Collaborator

This seems fine to me, but I think we should include an example, even in the comments, of making a source where inside the job the user can pass a filter (so, companion object or constructor takes an optional filter argument).

@isnotinvain
Copy link
Contributor Author

The reason I made this a trait not a constructor arg was so that you can use this feature even if the Source you want to add a filter predicate to doesn't have a constructor arg for the filter predicate.

For example:

// note, MyCustomSource does not take a filter as an argument
class MyCustomSource(dateRange: DateRange) 
  extends DailySuffixParquetThrift[SomeTBase]("/foo/bar", dateRange)

val customSourceWithFilter = new MyCustomSource(dr) { 
  override val filterPredicate = Some(fp) 
}

I can add an example of this. Do you think I should add constructor params as well? That can be done like this:

class MyCustomSource(
  dateRange: DateRange, 
  override val filterPredicate: Option[FilterPredicate] = None
) extends DailySuffixParquetThrift[SomeTBase]("/foo/bar", dateRange)

* you intend to use can also make your job significantly more efficient (parquet column projection
* push-down will skip reading unused columns from disk).
* The columns are specified in the format described here:
* https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc describes setting a key in the config. Is this how this works under the covers? What about multi-input (merges, joins)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is setup in sourceConfInit which handles this case

johnynek added a commit that referenced this pull request Sep 22, 2014
Add support for column projection to parquet sources
@johnynek johnynek merged commit 3fe261e into develop Sep 22, 2014
@johnynek johnynek deleted the alexlevenson/parquet-projection branch September 22, 2014 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants