Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Oct 14, 2019

What changes were proposed in this pull request?

There will be 2 times unsafeProjection convert operation When we read a Parquet data file use non-vectorized mode:

  1. ParquetGroupConverter call unsafeProjection function to covert SpecificInternalRow to UnsafeRow every times when read Parquet data file use ParquetRecordReader.

  2. ParquetFileFormat will call unsafeProjection function to covert this UnsafeRow to another UnsafeRow again when partitionSchema is not empty in DataSourceV1 branch, and PartitionReaderWithPartitionValues will always do this convert operation in DataSourceV2 branch.

In this pr, remove unsafeProjection convert operation in ParquetGroupConverter and change ParquetRecordReader to produce SpecificInternalRow instead of UnsafeRow.

Why are the changes needed?

The first time convert in ParquetGroupConverter is redundant and ParquetRecordReader return a InternalRow(SpecificInternalRow) is enough.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit Test

@LuciferYang LuciferYang changed the title [SPARK-29454][SQL]Reduce one unsafeProjection call when read parquet file [SPARK-29454][SQL]Reduce unsafeProjection times when read parquet file Oct 14, 2019
@LuciferYang LuciferYang changed the title [SPARK-29454][SQL]Reduce unsafeProjection times when read parquet file [SPARK-29454][SQL]Reduce unsafeProjection times when read Parquet file Oct 14, 2019
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Oct 14, 2019

@cloud-fan @xuanyuanking Could you take a look at the PR, thx~

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29454][SQL]Reduce unsafeProjection times when read Parquet file [SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file Oct 14, 2019
@dongjoon-hyun
Copy link
Member

ok to test

// UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
iter.asInstanceOf[Iterator[InternalRow]]
} else {
logDebug(s"Falling back to parquet-mr")
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this PR aiming only for non-vectorized code, @LuciferYang ? If then, please update the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok~

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112069 has finished for PR 26106 at commit ad7d94e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang LuciferYang changed the title [SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file [SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized read mode Oct 15, 2019
@LuciferYang LuciferYang changed the title [SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized read mode [SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized mode Oct 15, 2019
@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@cloud-fan cloud-fan closed this in a988aaf Oct 15, 2019
@LuciferYang
Copy link
Contributor Author

thx @dongjoon-hyun @cloud-fan ~

@LuciferYang LuciferYang deleted the spark-parquet-unsafe-projection branch June 6, 2022 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants