Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasteful to pull all primary key columns in query of append mode #1302

Closed
Rachelint opened this issue Nov 8, 2023 · 0 comments · Fixed by #1307
Closed

Wasteful to pull all primary key columns in query of append mode #1302

Rachelint opened this issue Nov 8, 2023 · 0 comments · Fixed by #1307
Labels
feature New feature or request

Comments

@Rachelint
Copy link
Contributor

Rachelint commented Nov 8, 2023

Describe This Problem

We pull projected + pk columns when querying in both append and overwrite mode.
However pulling pk columns is totally unnecessary in append mode.

Proposal

Just pulling the projected columns in query of append mode.

The record batch reading steps will be divided to following threes:

  • Pulling the ArrowRecordBatch.
  • Converting it to FetchingRecordBatch, somethings like filling not exist column a null/default value will be done here.
    FetchingRecordBatch will be used in ChainIterator/ MergeIterator, and it may include not only the projected columns but the primary key columns for dedupping or else.
  • Prune to RecordBatch, as saying above, FetchingRecordBatch can include not only projected columns, prune the non-projecteds here.

Main changes:

  • Don't pass the ProjectedSchema including too many informations to where building the inner RecordBatchStream in ChainIterator/ MergeIterator. Instead, refactor the RowProjector to just include the needed informations and pass it to(mainly ScanRequest and SstReadOptions).
  • Refactor RecordBatchWithKey to FetchingRecordBatch, the main difference is that FetchingRecordBatch can include primary_keys_indexes or not. Actually, FetchingRecordBatch should not include primary_keys_indexes anymore, but it is hard to remove it completely, so maybe we can delay it to later prs.

Additional Context

No response

@Rachelint Rachelint added the feature New feature or request label Nov 8, 2023
@Rachelint Rachelint changed the title Useless to pull all primary key columns in append mode Useless to pull all primary key columns in query of append mode Nov 8, 2023
@Rachelint Rachelint changed the title Useless to pull all primary key columns in query of append mode Wasteful to pull all primary key columns in query of append mode Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant