-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Distributed copy into V1 #11840
feat: Distributed copy into V1 #11840
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
This pull request's title is not fulfill the requirements. @JackTan25 please update it 🙏. Valid format:
Valid types:
|
let me give more tests and test the performance in the real cluster environment. |
This is great. But to split one stage file is very complex, hard for parquet file.
If a COPY only has a single file, don't distributed it by design. |
@BohuTANG I think split for parquet file is not in my plan for now (but I'm plan to do research on parquet, maybe it will be ok in the future, I don't know), and I think splitting a csv or tsv file is a very good idea, it will improve performance very much, also when it has only one file, if the file is very large, but just simply use only one node to insert, this is bad (for me). By the way, for my implementation, if it's only one file and not too large, it will has only one split, so actually it's not distributed. By the way, if there are multi files like 3 parquet files, it will result 3 splits naturally, so it's distributed for multi files, don't worry. |
Need some logic tests here to make sure this PR is working as expected. |
How should I test the all copy into grammar in the real cluster nodes? Is there a config file for me to use Databend's cluster environment to get the performance figure? And there are some bug for now. I will fix it later. |
For reference, Distributed delete test https://github.com/datafuselabs/databend/pull/11777/files#diff-898c866ebb6ce89a237d7d8f7973779ff8c4f1c0c1d28f4d03939e14f937f479 |
c1e08c3
to
0a05f2c
Compare
distributed copy and splittable are different
|
src/query/pipeline/sources/src/input_formats/input_format_text.rs
Outdated
Show resolved
Hide resolved
I will revert my split implementation |
src/query/pipeline/sources/src/input_formats/input_format_text.rs
Outdated
Show resolved
Hide resolved
it is not high priority to split of the text file:
|
06b0efc
to
727b3b3
Compare
get it, I have revert all my split logic |
tests/suites/1_stateful/00_copy/00_0011_dirtributed_copy_from_stage_for_text_file.sh
Outdated
Show resolved
Hide resolved
de15bf7
to
b85bd5d
Compare
… into distributed_copy_into
Docker Image for PR
|
… into distributed_copy_into
tests/suites/1_stateful/00_copy/00_0011_distributed_copy_into_table_exection_test.sh
Outdated
Show resolved
Hide resolved
chore: tidy up test scripts
fix: 00_0008_copy_with_projection test failure
fix: incorrect stage name use in 00_0011_distributed_copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. distributed copy is disabled by default, let's merge
remove `set enable_distributed_copy_into..` from py test script
* distributed copy into table from stage * delete useless codes * fix zero num_threads * add more files * add order * add more tests * correct test * rename to copy_into_table * reduce test data * fix test file * refactor code style, make it more clear * add explain pipeline for copy stmt * add explain test * remove eprintln * fix explain test result * fix explain test result * add purege * correct explain result * correct explain result * use box to avoid stack overflow * make sure distributed copy test * refactor code style * correct result * remove explain * drop products * modify tests * feat: new setting "enable_distributed_copy_into" * add more tests * add more * correct test * distributed copy into profile: forbidded * fix check * fix clippy * modify test * correct local id * remove error codes * add config to test * modify transform * modify test * fix test * update test * update test * tidy up test scripts * fix 00_0008_copy_with_projection test failure * fix 00_0011_distributed_copy_into_table_exection_test.sh * fix: incorrect stage name use in 00_0011_distributed_copy * remove `set enable_distributed_copy_into..` from py test script --------- Co-authored-by: dantengsky <dantengsky@gmail.com> Co-authored-by: everpcpc <everpcpc@users.noreply.github.com>
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
Summary about this PR
this pr is the first version for distributed copy into, this just supports 'copy into table from stage',and just support 'csv,tsv' files which are not compressed.As to other occasions will be given in the later pre.
the design details are below:
Closes #issue
#11752 #8594 #6395