Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor](load) Extract load public code #22304

Merged
merged 3 commits into from
Jul 29, 2023

Conversation

zzzzzzzs
Copy link
Contributor

Proposed changes

Extract load public code

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// specific language governing permissions and limitations
// under the License.

#include "util/load_util.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'util/load_util.h' file not found [clang-diagnostic-error]

#include "util/load_util.h"
         ^

be/test/util/load_util_test.cpp Show resolved Hide resolved
be/test/util/load_util_test.cpp Outdated Show resolved Hide resolved
be/test/util/load_util_test.cpp Outdated Show resolved Hide resolved
dataroaring
dataroaring previously approved these changes Jul 27, 2023
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 27, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify code as clang-tidy suggests.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jul 28, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/test/util/load_util_test.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 28, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@yiguolei
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.96 seconds
stream load tsv: 507 seconds loaded 74807831229 Bytes, about 140 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17167311072 Bytes

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaokang xiaokang merged commit 765f1b6 into apache:master Jul 29, 2023
morningman added a commit that referenced this pull request Aug 26, 2023
…ock codec (#23245)

1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
morningman pushed a commit to morningman/doris that referenced this pull request Aug 26, 2023
morningman added a commit to morningman/doris that referenced this pull request Aug 26, 2023
…ock codec (apache#23245)

1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: apache#22304
xiaokang pushed a commit that referenced this pull request Aug 26, 2023
…ock codec (#23245) (#23526)

* [Refactor](load) Extract load public code (#22304)

* [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)

1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304

---------

Co-authored-by: zzzzzzzs <1443539042@qq.com>
tudouzhao pushed a commit to tudouzhao/doris that referenced this pull request Aug 26, 2023
…ock codec (apache#23245)

1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: apache#22304
@zzzzzzzs zzzzzzzs deleted the refactor-load-public-code branch October 3, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants