-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](Hudi-mtmv) Support asynchronous materialized view partition refresh feature for Hudi external tables. #49956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
|
run buildall |
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
TPC-H: Total hot run time: 33994 ms |
TPC-DS: Total hot run time: 194215 ms |
ClickBench: Total hot run time: 30.93 s |
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 33977 ms |
TPC-DS: Total hot run time: 192263 ms |
ClickBench: Total hot run time: 29.69 s |
…nvolved in transparent rewriting (apache#49513) It's because when Hudi performs transparent rewriting, the timestamp of the base table obtained by `loadSnapshot` is inconsistent with the timestamp stored after partition refresh, which causes the comparison of `tableSnapshot` to fail, resulting in the materialized view not being hit. Currently, the logic of `loadSnapshot` to obtain the timestamp of the base table is a bit strange and doesn't quite meet expectations. Further in - depth research is needed on how to modify it. For now, in the `getTableSnapshot` function, simply return `0L` constantly, indicating that the `tableSnapshot` is always synchronized, to bypass this problem. This modification is consistent with the original expectation of manual refresh. We'll deal with this issue later.
f3f2ca7 to
a789d11
Compare
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 34134 ms |
TPC-DS: Total hot run time: 192911 ms |
ClickBench: Total hot run time: 30.08 s |
hubgeter
approved these changes
Apr 22, 2025
Contributor
|
PR approved by anyone and no changes requested. |
morningman
approved these changes
Apr 23, 2025
Contributor
|
PR approved by at least one committer and no changes requested. |
16 tasks
morningman
pushed a commit
that referenced
this pull request
Apr 29, 2025
### What problem does this PR solve? Followup #49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
16 tasks
morningman
pushed a commit
that referenced
this pull request
May 22, 2025
koarz
pushed a commit
to koarz/doris
that referenced
this pull request
Jun 4, 2025
…resh feature for Hudi external tables. (apache#49956) Problem Summary: Support asynchronous materialized view partition refresh feature for Hudi external tables.
koarz
pushed a commit
to koarz/doris
that referenced
this pull request
Jun 4, 2025
### What problem does this PR solve? Followup apache#49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
koarz
pushed a commit
to koarz/doris
that referenced
this pull request
Jun 4, 2025
…pache#50979) Problem Summary: related pr: apache#48172 This pr(apache#48172) had changed the code logical of method `beforeMTMVRefresh`, but this pr(apache#49956) added the code back. So we delete this code.
zddr
pushed a commit
to zddr/incubator-doris
that referenced
this pull request
Jun 19, 2025
…resh feature for Hudi external tables. (apache#49956) Problem Summary: Support asynchronous materialized view partition refresh feature for Hudi external tables.
zddr
pushed a commit
to zddr/incubator-doris
that referenced
this pull request
Jun 19, 2025
…pache#50979) Problem Summary: related pr: apache#48172 This pr(apache#48172) had changed the code logical of method `beforeMTMVRefresh`, but this pr(apache#49956) added the code back. So we delete this code.
zddr
pushed a commit
to zddr/incubator-doris
that referenced
this pull request
Jun 19, 2025
…resh feature for Hudi external tables. (apache#49956) Problem Summary: Support asynchronous materialized view partition refresh feature for Hudi external tables.
zddr
pushed a commit
to zddr/incubator-doris
that referenced
this pull request
Jun 19, 2025
…pache#50979) Problem Summary: related pr: apache#48172 This pr(apache#48172) had changed the code logical of method `beforeMTMVRefresh`, but this pr(apache#49956) added the code back. So we delete this code.
morningman
pushed a commit
to morningman/doris
that referenced
this pull request
Jun 24, 2025
Followup apache#49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
morningman
pushed a commit
to morningman/doris
that referenced
this pull request
Jun 25, 2025
Followup apache#49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
16 tasks
morningman
pushed a commit
that referenced
this pull request
Jun 26, 2025
…1152) ### What problem does this PR solve? Related PR: #49956 Problem Summary: In pr #49956, the concept of `HudiMvccSnapshot` is introduced to implement `hudi asynchronous materialized view partition refresh`. This pr uses the `LastUpdateTimestamp` of `TablePartitionValues` in `HudiMvccSnapshot` to obtain the hudi schema, which will cause the `LastUpdateTimestamp` value to be always 0 if the table is not a partitioned table. This will result in the actual hudischema not being obtained. This pr refers to `IcebergMvccSnapshot` and adds the concept of `timestamp` in `HudiMvccSnapshot` to obtain the correct hudi schema. Correct hudi schema: It contains information such as column unique id
morningman
pushed a commit
to morningman/doris
that referenced
this pull request
Jun 30, 2025
Followup apache#49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
morningman
pushed a commit
to morningman/doris
that referenced
this pull request
Jun 30, 2025
Followup apache#49956 Problem Summary: When a snapshot is specified in the query, the corresponding schema should be used for parsing, otherwise the latest snapshot should be used for parsing. 1. When using the HMS type, you also need to initialize the executor pool. 2. Set the size of the thread pool to be equal to the number of cores of the current machine. 3. When no snapshot is specified, the latest schema is used. 4. When specifying a snapshot, you need to use the schema corresponding to the snapshot. 5. When generating a scannode, save the schema information and no longer obtain it from the cache to prevent the cache from being refreshed. 6. When refreshing the schema, you need to refresh all schemas of related tables.
hubgeter
added a commit
to hubgeter/doris
that referenced
this pull request
Jul 2, 2025
…ache#51152) ### What problem does this PR solve? Related PR: apache#49956 Problem Summary: In pr apache#49956, the concept of `HudiMvccSnapshot` is introduced to implement `hudi asynchronous materialized view partition refresh`. This pr uses the `LastUpdateTimestamp` of `TablePartitionValues` in `HudiMvccSnapshot` to obtain the hudi schema, which will cause the `LastUpdateTimestamp` value to be always 0 if the table is not a partitioned table. This will result in the actual hudischema not being obtained. This pr refers to `IcebergMvccSnapshot` and adds the concept of `timestamp` in `HudiMvccSnapshot` to obtain the correct hudi schema. Correct hudi schema: It contains information such as column unique id
16 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Summary:
Support asynchronous materialized view partition refresh feature for Hudi external tables.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)