Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 1, 2016

What changes were proposed in this pull request?

This PR fixes ColumnVectorUtils.populate so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within VectorizedParquetRecordReader.java#L185.

When partition column types are explicitly given to DateType or TimestampType (rather than inferring the type of partition column), this fails with the exception below:

16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
    at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...

How was this patch tested?

Unit tests in SQLQuerySuite.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64780 has finished for PR 14919 at commit bef83fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

@HyukjinKwon can we add a more targeted unit test in one of the vectorized reader test suites that can explicitly test for partitioned columns?

@HyukjinKwon
Copy link
Member Author

@sameeragarwal Thanks for your comment. Yeap, I will.

@HyukjinKwon
Copy link
Member Author

@sameeragarwal Could you take another look please?

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64891 has finished for PR 14919 at commit 88f7d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2016

Test build #64909 has finished for PR 14919 at commit acf2a3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 4, 2016

@davies Do you mind if I ask to review please?

@sameeragarwal
Copy link
Member

LGTM, thanks @HyukjinKwon! cc @davies

@HyukjinKwon
Copy link
Member Author

Thanks @sameeragarwal !

@davies
Copy link
Contributor

davies commented Sep 9, 2016

LGTM, merging into master and 2.0 branch, thanks!

@asfgit asfgit closed this in f7d2143 Sep 9, 2016
asfgit pushed a commit that referenced this pull request Sep 9, 2016
… Parquet vectorized reader

This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).

When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:

```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```

Unit tests in `SQLQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14919 from HyukjinKwon/SPARK-17354.

(cherry picked from commit f7d2143)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
… Parquet vectorized reader

## What changes were proposed in this pull request?

This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).

When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:

```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```

## How was this patch tested?

Unit tests in `SQLQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14919 from HyukjinKwon/SPARK-17354.
@HyukjinKwon HyukjinKwon deleted the SPARK-17354 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants