Skip to content

Conversation

@Jibing-Li
Copy link
Contributor

Sometimes, one column may store a very long text value, collecting NDV for that column may use too much memory and may cause BE OOM. This pr is to limit the max string length to 1024 while collecting column stats to control BE memory usage.

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@Jibing-Li
Copy link
Contributor Author

run buildall

@Jibing-Li Jibing-Li marked this pull request as ready for review March 19, 2024 09:11
@Jibing-Li
Copy link
Contributor Author

run buildall

@Jibing-Li
Copy link
Contributor Author

run external

@Jibing-Li
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38444 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 267cc5d157a167dd4f78fbbcc090906c7a46e385, data reload: false

------ Round 1 ----------------------------------
q1	17649	4405	4161	4161
q2	2118	158	158	158
q3	10600	1210	1227	1210
q4	10236	775	796	775
q5	7529	3103	3040	3040
q6	204	123	123	123
q7	1038	593	564	564
q8	9352	2055	2011	2011
q9	7125	6672	6624	6624
q10	8408	3497	3535	3497
q11	438	224	221	221
q12	374	209	193	193
q13	17788	2844	2861	2844
q14	249	211	207	207
q15	517	463	482	463
q16	492	377	378	377
q17	985	544	643	544
q18	7266	6545	6585	6545
q19	3683	1442	1414	1414
q20	554	261	252	252
q21	3680	2917	2929	2917
q22	356	314	304	304
Total cold run time: 110641 ms
Total hot run time: 38444 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4133	4148	4062	4062
q2	331	239	238	238
q3	3002	2830	2816	2816
q4	1901	1592	1552	1552
q5	5365	5375	5362	5362
q6	201	115	117	115
q7	2258	1887	1901	1887
q8	3201	3340	3325	3325
q9	8740	8708	8721	8708
q10	3849	3784	3821	3784
q11	550	442	446	442
q12	724	543	553	543
q13	16932	2850	2911	2850
q14	278	243	257	243
q15	491	455	460	455
q16	470	438	413	413
q17	1740	1483	1479	1479
q18	7436	7136	7002	7002
q19	1628	1516	1559	1516
q20	1900	1756	1721	1721
q21	4713	4680	4703	4680
q22	542	452	472	452
Total cold run time: 70385 ms
Total hot run time: 53645 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 181916 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 267cc5d157a167dd4f78fbbcc090906c7a46e385, data reload: false

query1	934	372	354	354
query2	6541	1919	1918	1918
query3	6705	211	219	211
query4	31708	21666	21284	21284
query5	4374	393	393	393
query6	275	178	166	166
query7	4634	293	290	290
query8	241	170	172	170
query9	9371	2310	2311	2310
query10	567	256	262	256
query11	15264	14211	14233	14211
query12	134	87	85	85
query13	1675	422	415	415
query14	10062	7947	7892	7892
query15	255	197	202	197
query16	8221	255	251	251
query17	2012	583	535	535
query18	2109	269	278	269
query19	357	150	154	150
query20	99	89	89	89
query21	204	124	125	124
query22	5039	4825	4819	4819
query23	33743	32846	32865	32846
query24	10860	2861	2887	2861
query25	578	364	389	364
query26	1180	158	160	158
query27	2964	343	354	343
query28	7602	1903	1852	1852
query29	882	626	603	603
query30	301	147	145	145
query31	972	712	770	712
query32	94	57	53	53
query33	762	254	238	238
query34	1041	502	499	499
query35	809	620	601	601
query36	1009	864	920	864
query37	114	61	64	61
query38	3503	3453	3397	3397
query39	1452	1444	1417	1417
query40	211	109	112	109
query41	54	44	44	44
query42	101	96	95	95
query43	468	440	447	440
query44	1115	713	723	713
query45	269	268	234	234
query46	1126	738	715	715
query47	1931	1830	1840	1830
query48	436	354	355	354
query49	1095	339	345	339
query50	777	372	374	372
query51	6724	6633	6555	6555
query52	116	86	90	86
query53	346	275	273	273
query54	283	250	233	233
query55	91	78	81	78
query56	239	216	219	216
query57	1205	1143	1154	1143
query58	242	210	209	209
query59	2788	2520	2682	2520
query60	267	249	247	247
query61	114	112	112	112
query62	687	449	458	449
query63	308	283	287	283
query64	5590	4161	4101	4101
query65	3135	3022	3052	3022
query66	1453	377	372	372
query67	15398	14881	14670	14670
query68	5323	514	514	514
query69	586	383	387	383
query70	1205	1208	1186	1186
query71	393	272	276	272
query72	6426	2870	2677	2677
query73	707	319	324	319
query74	7713	6388	6368	6368
query75	2937	2199	2242	2199
query76	3084	904	878	878
query77	604	265	273	265
query78	10978	10239	10182	10182
query79	9143	531	536	531
query80	2009	389	370	370
query81	552	213	216	213
query82	1619	90	84	84
query83	316	140	139	139
query84	288	80	78	78
query85	2122	324	315	315
query86	483	319	303	303
query87	3713	3520	3530	3520
query88	4883	2313	2293	2293
query89	555	369	375	369
query90	1923	183	177	177
query91	178	137	136	136
query92	61	48	48	48
query93	7023	507	485	485
query94	1256	175	176	175
query95	444	354	334	334
query96	601	272	272	272
query97	2669	2495	2531	2495
query98	229	217	204	204
query99	1205	915	936	915
Total cold run time: 306444 ms
Total hot run time: 181916 ms

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit 267cc5d157a167dd4f78fbbcc090906c7a46e385 with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       19.8 seconds inserted 10000000 Rows, about 505K ops/s

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 28, 2024
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jibing-Li Jibing-Li merged commit cf1d4bb into apache:master Mar 28, 2024
Jibing-Li added a commit that referenced this pull request Mar 29, 2024
* [fix](merge cloud) Fix cloud be set be tag map (#32864)

* [chore] Add gavinchou to collaborators (#32881)

* [chore](show) support statement to show views from table (#32358)

MySQL [test]> show views;
+----------------+
| Tables_in_test |
+----------------+
| t1_view        |
| t2_view        |
+----------------+
2 rows in set (0.00 sec)

MySQL [test]> show views like '%t1%';
+----------------+
| Tables_in_test |
+----------------+
| t1_view        |
+----------------+
1 row in set (0.01 sec)

MySQL [test]> show views where create_time > '2024-03-18';
+----------------+
| Tables_in_test |
+----------------+
| t2_view        |
+----------------+
1 row in set (0.02 sec)

* [Enhancement](ranger) Disable some permission operations when Ranger or LDAP are enabled (#32538)

Disable some permission operations when Ranger or LDAP are enabled.

* [chore](ci) exclude unstable trino_connector case (#32892)

Co-authored-by: stephen <hello-stephen@qq.com>

* [fix](Nereids) NPE when create table with implicit index type (#32893)

* [improvement](mtmv) Support more join types for query rewriting by materialized view (#32685)

This pattern of rewriting is supported for multi-table joins and supported join types is as following:

INNER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI JOIN
RIGHT SEMI JOIN
LEFT ANTI JOIN
RIGHT ANTI JOIN

* [Serde](Variant) support arrow serialization for varint type (#32780)

* [fix](multicatalog) fix no data error when read hive table on cosn (#32815)

Currently, when reading a hive on cosn table, doris return empty result, but the table has data.
iceberg on cosn is ok.
The reason is misuse of cosn's file sytem. according to cosn's doc, its fs.cosn.impl should be org.apache.hadoop.fs.CosFileSystem

* [fix](nereids)EliminateGroupByConstant should replace agg's output after removing constant group by keys (#32878)

* [Fix](executor)Fix regression test for test_active_queries/test_backend_active_tasks #32899

* [fix](iceberg) fix iceberg catalog bug and p2 test cases (#32898)

1. Fix iceberg catalog bug

    This PR #30198 change the logic of `IcebergHMSExternalCatalog.java`,
    to get locationUrl by calling hive metastore's `getCatalog()` method.
    But this method only exists in hive 3+. So it will fail if we using hive 2.x.

    I temporary remove this logic, because this logic is only used from iceberg table writing.
    Which is still under development. We will rethink this logic later.

2. Fix test cases

    Some of P2 test cases missed `order_qt`. And because the output format of the floating point
    type is changed, some result in `out` files need to be regenerated.

* [revert](jni) revert part of #32455 (#32904)

* [fix](spill) Avoid releasing resources while spill tasks are executing (#32783)

* [chore](log) print query id before logging profile in be.INFO (#32922)

* [fix](grace-exit) Stop incorrectly of reportwork cause heap use after free #32929

* [improvement](decommission be) decommission check replica num (#32748)

* [fix](arrow-flight) Fix reach limit of connections error (#32911)

Fix Reach limit of connections error
in fe.conf , arrow_flight_token_cache_size is mandatory less than qe_max_connection/2. arrow flight sql is a stateless protocol, connection is usually not actively disconnected, bearer token is evict from the cache will unregister ConnectContext.

Fix ConnectContext.command not be reset to COM_SLEEP in time, this will result in frequent kill connection after query timeout.

Fix bearer token evict log and exception.

TODO: use arrow flight session: https://mail.google.com/mail/u/0/#inbox/FMfcgzGxRdxBLQLTcvvtRpqsvmhrHpdH

* [bugfix](cloud) few variable not initialized (#32868)

../../cloud/src/recycler/meta_checker.cpp
can cause uninitialised memory read.

* [fix](arrow-flight) Fix arrow flight sql compatible with JDK 17 and upgrade arrow 15.0.2 (#32796)

--add-opens=java.base/java.nio=ALL-UNNAMED, see: https://arrow.apache.org/docs/java/install.html#java-compatibility
groovy use flight sql connection to execute query SUM(MAX(c1) OVER (PARTITION BY)) report error: AGGREGATE clause must not contain analytic expressions, but no problem in Java execute it with jdbc::arrow-flight-sql.
groovy not support print arrow array type, throw IndexOutOfBoundsException.
"arrow_flight_sql" not support two phase read
./run-regression-test.sh --run --clean -g arrow_flight_sql

* [fix](spill) SpillStream's writer maybe may not have been finalized (#32931)

* [improvement](spill) Disable DistinctStreamingAgg when spill is enabled (#32932)

* [Improve](inverted_index) update clucene and improve array inverted index writer  (#32436)

* [Performance](exec) replace SipHash in function by XXHash (#32919)

* [feature](agg) add aggregate function sum0 (#32541)

* [improvement](mtmv) Support to get tables in materialized view when collecting table in plan (#32797)

Support to get tables in materialized view when collecting table in plan

table scehma as fllowing:

create materialized view mv1
BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 1 
PROPERTIES ('replication_num' = '1')
 as 
select 
  t1.c1, 
  t3.c2 
from 
  table1 t1 
  inner join table3 t3 on t1.c1 = t3.c2

if get table from the plan as follwoing, we can get [table1, table3, table2], the mv1 is expanded to get base tables;

SELECT 
  mv1.*, 
  uuid() 
FROM 
  mv1 LEFT SEMI 
  JOIN table2 ON mv1.c1 = table2.c1 
WHERE 
  mv1.c1 IN (
    SELECT 
      c1 
    FROM 
      table2
  ) 
  OR mv1.c1 < 10

* [enhance](mtmv)support olap table partition column is null (#32698)

* [enhancement](cloud) add table version to cloud (#32738)

Add table version to cloud.

In Fe:
Get: If Fe is cloud mode, get table version from meta service.
Update: Op drop/replace temp partition, commit transaction.

In meta service:
Add: create Index. init value is 1.
Remove: by recycler.
Update: commit/drop partition rpc, commit txn rpc. Atomic++.

* [fix](cloud) schema change from not null to null (#32913)

1. Use equals instead of == for type comparing
2. null bitmap size is reisze by size of ref column.

* [feature](Nereids): add ColumnPruningPostProcessor. (#32800)

* [case](rowpolicy)fix row policy has been exist (#32880)

* [fix](pipeline) fix use error row desc when origin block clear (#32803)

* [fix](Nereids) support variant column with index when create table (#32948)

* [opt](Nereids) support create table with variant type (#32953)

* [test](insert-overwrite) Add insert overwrite auto detect concurrency cases (#32935)

* [fix](compile) fe cannot compile in idea (#32955)

* [enhancement](plsql) Support select * from routines (#32866)

Support show of plsql procedure using select * from routines.

* [fix](trino-connector) fix `NoClassDefFoundError` of hudi `Utils` class (#32846)

Due to the change of this PR #32455 , the `trino-connector-scanner` package cannot access the `hudi_scanner` package, so the exception NoclassDeffounderror will appear.

We need to write a separate Utils class.

* [exec](column) change some complex column move to noexcept (#32954)

* [Enhancement](data skew) extends show data skew (#32732)

* [chore](test) let suite compatible with Nereids (#32964)

* Support identical column name in different index. (#32792)

* Limit the max string length to 1024 while collecting column stats to control BE memory usage. (#32470)

* [fix](merge-iterator) fix NOT_IMPLEMENTED_ERROR when read next block view (#32961)

* [improvement](executor)Add tag property for workload group #32874

* [fix](auth)unified workload and resource permission logic (#32907)

- `Grant resource` can no longer grant global `usage_priv`
-  `grant resource %` instead of `grant resource *`

before change:
```
grant usage_priv on resource * to f;
show grants for f\G
*************************** 1. row ***************************
      UserIdentity: 'f'@'%'
           Comment: 
          Password: No
             Roles: 
       GlobalPrivs: Usage_priv 
      CatalogPrivs: NULL
     DatabasePrivs: internal.information_schema: Select_priv ; internal.mysql: Select_priv 
        TablePrivs: NULL
          ColPrivs: NULL
     ResourcePrivs: NULL
 CloudClusterPrivs: NULL
WorkloadGroupPrivs: normal: Usage_priv 
```
after change
```
grant usage_priv on resource '%' to f;
show grants for f\G
*************************** 1. row ***************************
      UserIdentity: 'f'@'%'
           Comment: 
          Password: No
             Roles: 
       GlobalPrivs: NULL
      CatalogPrivs: NULL
     DatabasePrivs: internal.information_schema: Select_priv ; internal.mysql: Select_priv 
        TablePrivs: NULL
          ColPrivs: NULL
     ResourcePrivs: %: Usage_priv 
 CloudClusterPrivs: NULL
WorkloadGroupPrivs: normal: Usage_priv 

```

---------

Co-authored-by: yujun <yu.jun.reach@gmail.com>
Co-authored-by: Gavin Chou <gavineaglechou@gmail.com>
Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>
Co-authored-by: yongjinhou <109586248+yongjinhou@users.noreply.github.com>
Co-authored-by: Dongyang Li <hello_stephen@qq.com>
Co-authored-by: stephen <hello-stephen@qq.com>
Co-authored-by: morrySnow <101034200+morrySnow@users.noreply.github.com>
Co-authored-by: seawinde <149132972+seawinde@users.noreply.github.com>
Co-authored-by: lihangyu <15605149486@163.com>
Co-authored-by: Yulei-Yang <yulei.yang0699@gmail.com>
Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com>
Co-authored-by: wangbo <wangbo@apache.org>
Co-authored-by: Mingyu Chen <morningman@163.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
Co-authored-by: zhiqiang <seuhezhiqiang@163.com>
Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com>
Co-authored-by: Vallish Pai <vallishpai@gmail.com>
Co-authored-by: amory <wangqiannan@selectdb.com>
Co-authored-by: HappenLee <happenlee@hotmail.com>
Co-authored-by: Jensen <czjourney@163.com>
Co-authored-by: zhangdong <493738387@qq.com>
Co-authored-by: Yongqiang YANG <98214048+dataroaring@users.noreply.github.com>
Co-authored-by: jakevin <jakevingoo@gmail.com>
Co-authored-by: Mryange <59914473+Mryange@users.noreply.github.com>
Co-authored-by: zclllyybb <zhaochangle@selectdb.com>
Co-authored-by: Tiewei Fang <43782773+BePPPower@users.noreply.github.com>
Co-authored-by: Xin Liao <liaoxinbit@126.com>
@Jibing-Li Jibing-Li deleted the substring branch April 1, 2024 02:25
Jibing-Li added a commit to Jibing-Li/incubator-doris that referenced this pull request Apr 1, 2024
Jibing-Li added a commit that referenced this pull request Apr 1, 2024
yiguolei pushed a commit that referenced this pull request Apr 1, 2024
yiguolei pushed a commit that referenced this pull request Apr 10, 2024
@yiguolei yiguolei mentioned this pull request Apr 26, 2024
1 task
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.8-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants