You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We pull projected + pk columns when querying in both append and overwrite mode.
However pulling pk columns is totally unnecessary in append mode.
Proposal
Just pulling the projected columns in query of append mode.
The record batch reading steps will be divided to following threes:
Pulling the ArrowRecordBatch.
Converting it to FetchingRecordBatch, somethings like filling not exist column a null/default value will be done here. FetchingRecordBatch will be used in ChainIterator/ MergeIterator, and it may include not only the projected columns but the primary key columns for dedupping or else.
Prune to RecordBatch, as saying above, FetchingRecordBatch can include not only projected columns, prune the non-projecteds here.
Main changes:
Don't pass the ProjectedSchema including too many informations to where building the inner RecordBatchStream in ChainIterator/ MergeIterator. Instead, refactor the RowProjector to just include the needed informations and pass it to(mainly ScanRequest and SstReadOptions).
Refactor RecordBatchWithKey to FetchingRecordBatch, the main difference is that FetchingRecordBatch can include primary_keys_indexes or not. Actually, FetchingRecordBatch should not include primary_keys_indexes anymore, but it is hard to remove it completely, so maybe we can delay it to later prs.
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Rachelint
changed the title
Useless to pull all primary key columns in append mode
Useless to pull all primary key columns in query of append mode
Nov 8, 2023
Rachelint
changed the title
Useless to pull all primary key columns in query of append mode
Wasteful to pull all primary key columns in query of append mode
Nov 16, 2023
Describe This Problem
We pull
projected + pk
columns when querying in bothappend
andoverwrite
mode.However pulling
pk
columns is totally unnecessary inappend
mode.Proposal
Just pulling the
projected
columns in query ofappend
mode.The record batch reading steps will be divided to following threes:
ArrowRecordBatch
.FetchingRecordBatch
, somethings like filling not exist column a null/default value will be done here.FetchingRecordBatch
will be used inChainIterator
/MergeIterator
, and it may include not only the projected columns but the primary key columns for dedupping or else.RecordBatch
, as saying above,FetchingRecordBatch
can include not only projected columns, prune the non-projecteds here.Main changes:
ProjectedSchema
including too many informations to where building the innerRecordBatchStream
inChainIterator
/MergeIterator
. Instead, refactor theRowProjector
to just include the needed informations and pass it to(mainlyScanRequest
andSstReadOptions
).RecordBatchWithKey
toFetchingRecordBatch
, the main difference is thatFetchingRecordBatch
can includeprimary_keys_indexes
or not. Actually,FetchingRecordBatch
should not includeprimary_keys_indexes
anymore, but it is hard to remove it completely, so maybe we can delay it to later prs.Additional Context
No response
The text was updated successfully, but these errors were encountered: