Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#38245) #38810

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Aug 2, 2024

Proposed changes

pick pr: #38575 and fix this pr bug : #38245

… without repeated deserialization. (apache#37377)

## Proposed changes

Since the value of the partition column is fixed when querying the
partition table, we can deserialize the value only once and then
repeatedly insert the value into the block.
```sql
in Hive: 
CREATE TABLE parquet_partition_tb (
    col1 STRING,
    col2 INT,
    col3 DOUBLE
) PARTITIONED BY (
    partition_col1 STRING,
    partition_col2 INT
)
STORED AS PARQUET;

insert into  parquet_partition_tb partition (partition_col1="hello",partition_col2=1) values("word",2,2.3);

insert into parquet_partition_tb partition(partition_col1="hello",partition_col2=1 )  
select col1,col2,col3 from  parquet_partition_tb where partition_col1="hello" and partition_col2=1;
Repeat the `insert into xxx select  xxx`operation several times.


Doris :
before:
mysql>  select count(partition_col1) from parquet_partition_tb;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.24 sec)

mysql>  select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (3.34 sec)


after:
mysql>  select count(partition_col1) from parquet_partition_tb ;
+-----------------------+
| count(partition_col1) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.79 sec)

mysql> select count(partition_col2) from parquet_partition_tb;
+-----------------------+
| count(partition_col2) |
+-----------------------+
|              33554432 |
+-----------------------+
1 row in set (0.51 sec)

```
## Summary:
test sql `select count(partition_col) from tbl;`
Number of lines : 33554432
| |before | after|
|---|---|--|
|boolean |  3.96|0.47  | 
|tinyint  |  3.39|0.47  |  
|smallint |  3.14|0.50   |
|int    |3.34|0.51   | 
|bigint  |   3.61|0.51  |
|float   | 4.59 |0.51  | 
|double   |4.60| 0.55  | 
|decimal(5,2)|  3.96  |0.61 | 
|date   | 5.80|0.52    | 
|timestamp |  7.68 | 0.52 | 
|string  |  3.24 |0.79   | 

Issue Number: close #xxx

<!--Describe your changes.-->
…rom_fixed_json (apache#38245)

## Proposed changes
fix a bug in DataTypeNullableSerDe.deserialize_column_from_fixed_json.

The expected behavior of the `deserialize_column_from_fixed_json`
function is to `insert` n values ​​into the column.

However, when the `DataTypeNullableSerDe` class implements this
function, the null_map column is `resize` to n, which does not insert n
values ​​into it. Since this function is only used by the
`_fill_partition_columns` of the `parquet/orc reader` and is not called
repeatedly for a `get_next_block`, this bug is covered up.
before pr : apache#37377
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 2, 2024

run buildall

Copy link
Contributor

github-actions bot commented Aug 2, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.39% (9256/25438)
Line Coverage: 27.92% (75671/271012)
Region Coverage: 26.76% (38898/145385)
Branch Coverage: 23.47% (19730/84050)
Coverage Report: http://coverage.selectdb-in.cc/coverage/2b7b903ff2aa7d61e2b6b43f01e0e852ea98bdc4_2b7b903ff2aa7d61e2b6b43f01e0e852ea98bdc4/report/index.html

@yiguolei yiguolei merged commit 607c0b8 into apache:branch-2.1 Aug 5, 2024
20 of 22 checks passed
@yiguolei yiguolei mentioned this pull request Sep 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants