Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Distributed copy into V1 #11840

Merged
merged 59 commits into from
Jun 28, 2023

Conversation

JackTan25
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Summary about this PR
this pr is the first version for distributed copy into, this just supports 'copy into table from stage',and just support 'csv,tsv' files which are not compressed.As to other occasions will be given in the later pre.
the design details are below:
image
image

Closes #issue
#11752 #8594 #6395

@vercel
Copy link

vercel bot commented Jun 22, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
databend ⬜️ Ignored (Inspect) Visit Preview Jun 28, 2023 1:39pm

@mergify
Copy link
Contributor

mergify bot commented Jun 22, 2023

This pull request's title is not fulfill the requirements. @JackTan25 please update it 🙏.

Valid format:

fix(query): fix group by string bug
  ^         ^---------------------^
  |         |
  |         +-> Summary in present tense.
  |
  +-------> Type: rfc, feat, fix, refactor, ci, docs, chore

Valid types:

  • rfc: this PR proposes a new RFC
  • feat: this PR introduces a new feature to the codebase
  • fix: this PR patches a bug in codebase
  • refactor: this PR changes the code base without new features or bugfix
  • ci: this PR changes build/testing/ci steps
  • docs: this PR changes the documents or websites
  • chore: this PR only has small changes that no need to record

@JackTan25
Copy link
Contributor Author

let me give more tests and test the performance in the real cluster environment.

@JackTan25 JackTan25 changed the title Distributed copy into feat: Distributed copy into Jun 22, 2023
@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Jun 22, 2023
@JackTan25 JackTan25 changed the title feat: Distributed copy into feat: Distributed copy into V1 Jun 22, 2023
@BohuTANG
Copy link
Member

This is great.

But to split one stage file is very complex, hard for parquet file.
How about to distribute by files? The flow like:

  1. COPY from stage has [file1, file2, file3]
  2. Distributed file1 and file2 and file3 to nodes of the cluster

If a COPY only has a single file, don't distributed it by design.

cc @dantengsky @youngsofun

@JackTan25
Copy link
Contributor Author

JackTan25 commented Jun 22, 2023

@BohuTANG I think split for parquet file is not in my plan for now (but I'm plan to do research on parquet, maybe it will be ok in the future, I don't know), and I think splitting a csv or tsv file is a very good idea, it will improve performance very much, also when it has only one file, if the file is very large, but just simply use only one node to insert, this is bad (for me). By the way, for my implementation, if it's only one file and not too large, it will has only one split, so actually it's not distributed. By the way, if there are multi files like 3 parquet files, it will result 3 splits naturally, so it's distributed for multi files, don't worry.

@BohuTANG
Copy link
Member

Need some logic tests here to make sure this PR is working as expected.

@JackTan25
Copy link
Contributor Author

How should I test the all copy into grammar in the real cluster nodes? Is there a config file for me to use Databend's cluster environment to get the performance figure? And there are some bug for now. I will fix it later.

@BohuTANG
Copy link
Member

@JackTan25 JackTan25 force-pushed the distributed_copy_into branch 3 times, most recently from c1e08c3 to 0a05f2c Compare June 24, 2023 12:31
@youngsofun
Copy link
Member

youngsofun commented Jun 25, 2023

distributed copy and splittable are different

  1. CSV is not splittable; NDJson is not splittable(done); TSV has no stands, Hive Text is splittable, Mysql TSV is not(current impl need improve for this)
  2. currently a split is a whole file or part of a file(if splittable); distributed based on splits
  3. a pr for Distributed only need to based on current split implementation

@JackTan25
Copy link
Contributor Author

I will revert my split implementation

@youngsofun
Copy link
Member

it is not high priority to split of the text file:

  1. many requirements: no compression( a few exceptions), text file with no enclosing and line terminator escaped
  2. compression is common for text file to save space
  3. big single files are more common log processing with spark/hadoop;big structured data are more commonly stored as parquet;

@JackTan25
Copy link
Contributor Author

  1. text file with no enclosing

get it, I have revert all my split logic

@JackTan25 JackTan25 force-pushed the distributed_copy_into branch 2 times, most recently from de15bf7 to b85bd5d Compare June 25, 2023 12:25
@everpcpc everpcpc added the ci-cloud Build docker image for cloud test label Jun 27, 2023
@everpcpc everpcpc added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jun 27, 2023
@github-actions
Copy link
Contributor

Docker Image for PR

  • tag: pr-11840-f726d5f5f928afd0858d55190f5375156bbfd42f

Copy link
Member

@dantengsky dantengsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. distributed copy is disabled by default, let's merge

@dantengsky dantengsky merged commit 02adab2 into databendlabs:main Jun 28, 2023
andylokandy pushed a commit to andylokandy/databend that referenced this pull request Nov 27, 2023
* distributed copy into table from stage

* delete useless codes

* fix zero num_threads

* add more files

* add order

* add more tests

* correct test

* rename to copy_into_table

* reduce test data

* fix test file

* refactor code style, make it more clear

* add explain pipeline for copy stmt

* add explain test

* remove eprintln

* fix explain test result

* fix explain test result

* add purege

* correct explain result

* correct explain result

* use box to avoid stack overflow

* make sure distributed copy test

* refactor code style

* correct result

* remove explain

* drop products

* modify tests

* feat: new setting "enable_distributed_copy_into"

* add more tests

* add more

* correct test

* distributed copy into profile: forbidded

* fix check

* fix clippy

* modify test

* correct local id

* remove error codes

* add config to test

* modify transform

* modify test

* fix test

* update test

* update test

* tidy up test scripts

* fix 00_0008_copy_with_projection test failure

* fix 00_0011_distributed_copy_into_table_exection_test.sh

* fix: incorrect stage name use in 00_0011_distributed_copy

* remove `set enable_distributed_copy_into..` from py test script

---------

Co-authored-by: dantengsky <dantengsky@gmail.com>
Co-authored-by: everpcpc <everpcpc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants