Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Dec 7, 2024

What problem does this PR solve?

Problem Summary:
Optimize reading of maxcompute partition tables:

  1. Introduce batch mode to generate splits for Maxcompute partition tables to optimize scenarios with a large number of partitions. Control it through the variable num_partitions_in_batch_mode.
  2. Introduce catalog parameter mc.split_cross_partition. The parameter is true, which is more friendly to reading partition tables, and false, which is more friendly to debug.
  3. Add -Darrow.enable_null_check_for_get=false to be jvm to improve the efficiency of mc arrow data conversion.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

hubgeter commented Dec 7, 2024

run buildall

}

public void setException(UserException e) {
public synchronized void setException(UserException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using synchronized ? add comment in code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is no need to add synchronized.

CompletableFuture.runAsync(() -> {
try {
TableBatchReadSession tableBatchReadSession =
createTableBatchReadSession(requiredBatchPartitionSpecs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like in this new implementation, you create read session for each partition?
Is create read session a heavy operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create read session for a batch of partition . createTableBatchReadSession have network io , and increasing the number of partitions will cause network io to be very slow. I tested 1500 partitions and it took about 13 seconds.

@hubgeter
Copy link
Contributor Author

hubgeter commented Dec 8, 2024

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 8, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2024

PR approved by anyone and no changes requested.

@morningman morningman merged commit 4cf908c into apache:master Dec 9, 2024
26 of 27 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 9, 2024
### What problem does this PR solve?
Problem Summary:
Optimize reading of maxcompute partition tables:
1. Introduce batch mode to generate splits for Maxcompute partition
tables to optimize scenarios with a large number of partitions. Control
it through the variable `num_partitions_in_batch_mode`.
2. Introduce catalog parameter `mc.split_cross_partition`. The parameter
is true, which is more friendly to reading partition tables, and false,
which is more friendly to debug.
3. Add `-Darrow.enable_null_check_for_get=false` to be jvm to improve
the efficiency of mc arrow data conversion.
morningman pushed a commit that referenced this pull request Dec 10, 2024
…ables. #45148 (#45168)

Cherry-picked from #45148

Co-authored-by: daidai <changyuwei@selectdb.com>
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 10, 2024
…he#45148)

Problem Summary:
Optimize reading of maxcompute partition tables:
1. Introduce batch mode to generate splits for Maxcompute partition
tables to optimize scenarios with a large number of partitions. Control
it through the variable `num_partitions_in_batch_mode`.
2. Introduce catalog parameter `mc.split_cross_partition`. The parameter
is true, which is more friendly to reading partition tables, and false,
which is more friendly to debug.
3. Add `-Darrow.enable_null_check_for_get=false` to be jvm to improve
the efficiency of mc arrow data conversion.
yiguolei pushed a commit that referenced this pull request Dec 11, 2024
…) (#45246)

bp #45148

### What problem does this PR solve?
Problem Summary:
Optimize reading of maxcompute partition tables:
1. Introduce batch mode to generate splits for Maxcompute partition
tables to optimize scenarios with a large number of partitions. Control
it through the variable `num_partitions_in_batch_mode`.
2. Introduce catalog parameter `mc.split_cross_partition`. The parameter
is true, which is more friendly to reading partition tables, and false,
which is more friendly to debug.
3. Add `-Darrow.enable_null_check_for_get=false` to be jvm to improve
the efficiency of mc arrow data conversion.
@yiguolei yiguolei mentioned this pull request Jan 19, 2025
@gavinchou gavinchou mentioned this pull request Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.4-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants