-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized mode #26106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized mode #26106
Conversation
|
@cloud-fan @xuanyuanking Could you take a look at the PR, thx~ |
|
ok to test |
| // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy. | ||
| iter.asInstanceOf[Iterator[InternalRow]] | ||
| } else { | ||
| logDebug(s"Falling back to parquet-mr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this PR aiming only for non-vectorized code, @LuciferYang ? If then, please update the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~
|
Test build #112069 has finished for PR 26106 at commit
|
|
LGTM, merging to master! |
|
thx @dongjoon-hyun @cloud-fan ~ |
What changes were proposed in this pull request?
There will be 2 times unsafeProjection convert operation When we read a Parquet data file use non-vectorized mode:
ParquetGroupConvertercall unsafeProjection function to covertSpecificInternalRowtoUnsafeRowevery times when read Parquet data file useParquetRecordReader.ParquetFileFormatwill call unsafeProjection function to covert thisUnsafeRowto anotherUnsafeRowagain when partitionSchema is not empty in DataSourceV1 branch, andPartitionReaderWithPartitionValueswill always do this convert operation in DataSourceV2 branch.In this pr, remove
unsafeProjectionconvert operation inParquetGroupConverterand changeParquetRecordReaderto produceSpecificInternalRowinstead ofUnsafeRow.Why are the changes needed?
The first time convert in
ParquetGroupConverteris redundant andParquetRecordReaderreturn aInternalRow(SpecificInternalRow)is enough.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit Test