Optimized the read performance of the table when have multi versions #4958
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
changed the merge method of the unique table,
merged the cumulative version data first, and then merged with the base version.
For the data with only one base version, read directly without merging
Proposed changes
Optimize the read performance of AGG and UNIQUE table with too much versions
the benchmark as follows
By the way, in most cases, the meaning of merge in our code is merge sort, so to avoid ambiguity I renamed some functions and variables
MergeHeap -> MergeSortHeap
MergeIterator -> MergeSortIterator
new_merge_iterator -> new_sort_iterator
Test Data
This test data set is the catalog_sales data in the tpcds 10G data set. The data is divided into two parts, one is the complete data (corresponding to the big table) 3G, a total of 28802522 (after unqiue 14401261) rows, one only has 200M sampling data (corresponding to test The table) has a total of 1,000,000 rows, only one partition is used, and the complete data is divided into 10 buckets.
All tables are in segment V2 format
This test is mainly to test the read performance, especially when there are a large number of small versions that are merged, so the test query for this time is
select count(*) from (select k1,k2,k3,k4,k5 ,k6,k7,k8,k9,k10 from table_name) a;UNIQUE_KEY and UNIQUE_KEY comparison
First of all, this test compares the difference in read performance between the UNIQUE_KEY table and the DUPLICTE_KEY table. The first version of the two versions is an empty version
It can be seen that the read speed of the duplicate table is about 1 times that of unique
Optimized data
After optimization, when the number of base versions is relatively small, the query performance is not much different.
UNIQUE_KEY multi-version reading optimization comparison
Since the data imported in multiple versions is random data, the data of the non-full version is a bit different, the test query is
select count(*) from table_nameWhen there are many versions of a table, we have a lot of time spent on sorting and merging rowset
In our scenario, this sorting is actually a merge sorting of multiple ordered queues, and usually due to the existence of compapction, we will have a relatively large base rowset and several small rowset, so we can combine The small rowset is sorted first and merged with the large rowset to optimize the read performance
Types of changes
What types of changes does your code introduce to Doris?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...