Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 28, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13537

What changes were proposed in this pull request?

In readBytes of VectorizedPlainValuesReader, we use buffer[offset] to access bytes in buffer. It is incorrect because offset is added with Platform.BYTE_ARRAY_OFFSET when initialization. We should fix it.

How was this patch tested?

ParquetHadoopFsRelationSuite sometimes (depending on the randomly generated data) will be failed by this bug. After applying this, the test can be passed.

I added a test to ParquetHadoopFsRelationSuite with the data which will fail without this patch.

The error exception:

[info] ParquetHadoopFsRelationSuite:
[info] - test all data types - StringType (440 milliseconds)
[info] - test all data types - BinaryType (434 milliseconds)
[info] - test all data types - BooleanType (406 milliseconds)
20:59:38.618 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2597.0 (TID 67966)
java.lang.ArrayIndexOutOfBoundsException: 46
at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBytes(VectorizedPlainValuesReader.java:88)

@viirya
Copy link
Member Author

viirya commented Feb 28, 2016

cc @nongli @rxin

@SparkQA
Copy link

SparkQA commented Feb 28, 2016

Test build #52142 has finished for PR 11418 at commit 44f5c41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2016

Test build #52143 has finished for PR 11418 at commit 1b09304.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Feb 29, 2016

LGTM

Thanks for fixing this. Just out of curiosity, how did you find this initially?

@viirya
Copy link
Member Author

viirya commented Feb 29, 2016

I saw the failure in #11415 jenkins test report. Then I rerun the test locally to find the problematic data and do debugging with it.

@rxin
Copy link
Contributor

rxin commented Feb 29, 2016

Thanks - I've merged this in master.

@asfgit asfgit closed this in 6dfc4a7 Feb 29, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
JIRA: https://issues.apache.org/jira/browse/SPARK-13537

## What changes were proposed in this pull request?

In readBytes of VectorizedPlainValuesReader, we use buffer[offset] to access bytes in buffer. It is incorrect because offset is added with Platform.BYTE_ARRAY_OFFSET when initialization. We should fix it.

## How was this patch tested?

`ParquetHadoopFsRelationSuite` sometimes (depending on the randomly generated data) will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52136/consoleFull) by this bug. After applying this, the test can be passed.

I added a test to `ParquetHadoopFsRelationSuite` with the data which will fail without this patch.

The error exception:

    [info] ParquetHadoopFsRelationSuite:
    [info] - test all data types - StringType (440 milliseconds)
    [info] - test all data types - BinaryType (434 milliseconds)
    [info] - test all data types - BooleanType (406 milliseconds)
    20:59:38.618 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2597.0 (TID 67966)
    java.lang.ArrayIndexOutOfBoundsException: 46
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBytes(VectorizedPlainValuesReader.java:88)

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#11418 from viirya/fix-readbytes.
@viirya viirya deleted the fix-readbytes branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants