Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further compact blocks when insert/load data with many threads #8311

Closed
Tracked by #7823
youngsofun opened this issue Oct 19, 2022 · 5 comments · Fixed by #8644
Closed
Tracked by #7823

further compact blocks when insert/load data with many threads #8311

youngsofun opened this issue Oct 19, 2022 · 5 comments · Fixed by #8644

Comments

@youngsofun
Copy link
Member

Summary

already compacted in each threads.

@youngsofun
Copy link
Member Author

youngsofun commented Oct 20, 2022

compactor----------->SinkBigBlock-------------------+
                                                    |
                                                    |
                                                    v
                                                 compactor--------->SinkFinal
                                                    ^
                                                    |
                                                    |
compactor----------->SinkBigBlock-------------------+

Sink to encode block to parquet and write to s3.

this is already used in unloading to file.
we can extract this as a shared code (Sink as a type parameter)


  1. For copy into from file & streaming load, per thread compactor can be avoided
  2. SinkBigBlock keep a queue of small_blocks (including those split from big ones)
    1. put small blocks here and write big blocks at once,
    2. if current small_blocks is big in total, write them;
    3. when input is finished, send the remaining of small_blocks to SinkFinal.

@sundy-li @zhyass @dantengsky

@youngsofun
Copy link
Member Author

youngsofun commented Oct 25, 2022

because a processor has only one output, have to choose 1 of the 2:

  1. merge compactor into Sink ( call compactor inside)
  2. add a flag in the DataBlock indicating whether to SinkBigBlock should process it or pass it to downstream

@sundy-li

@sundy-li
Copy link
Member

sundy-li commented Oct 25, 2022

because a processor has only one output

why not use resize the processor into one ? The output could be Option<Outport>

@youngsofun
Copy link
Member Author

@sundy-li I mean SinkBigBlock do not have information for branching: sink at once or pass it to downstream (compactor and final sink, resized to 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants