Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](hist) Fix unstable result of aggregrate function hist #38608

Merged
merged 11 commits into from
Aug 5, 2024

Conversation

zhiqiang-hhhh
Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh commented Jul 31, 2024

  • Target

Fix unstable result of hist function when involving null value.

  • Reproduce

test result of regression-test/suites/query_p0/sql_functions/aggregate_functions/test_aggregate_all_functions2.groovy is unstable, sql SELECT histogram(k7, 5) FROM baseall will sometimes acts like the second argument is not passed in.

  • Root reason

We have short-circuit in AggregateFunctionNullVariadicInline, when this row is NULL, the value will not be added by the nested function. Implementation of histogram relies on its add method to get its seconds argument, when we have an all null value block, histogram will not get its seconds arg even if sql is like select(k7, 5), so a max_bucket_num with default value 128 is serialized. When we do merging, and happens to deserialize the above block at last, the max_bucket_num in merge stage will be assigned to 128, and this leads to the wrong result.

  • Fix by

Init value of max_bucket_num is assigned to 0, when we do merging, we will discard this aggregated data if its max_bucket_num is 0.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh marked this pull request as ready for review July 31, 2024 09:37
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 42245 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2c528ed1b4724b568fe0d463290a1a65a158a086, data reload: false

------ Round 1 ----------------------------------
q1	17909	4130	4101	4101
q2	2014	203	206	203
q3	10468	1368	1357	1357
q4	11542	841	923	841
q5	7652	2977	2989	2977
q6	227	147	142	142
q7	1046	610	614	610
q8	9425	1854	1940	1854
q9	8696	6664	6687	6664
q10	8757	3877	3863	3863
q11	428	256	250	250
q12	423	236	231	231
q13	17757	3038	3106	3038
q14	283	255	260	255
q15	539	488	492	488
q16	532	411	415	411
q17	1009	940	937	937
q18	9714	7884	7436	7436
q19	2206	1238	1212	1212
q20	574	335	351	335
q21	5249	4840	4759	4759
q22	346	281	290	281
Total cold run time: 116796 ms
Total hot run time: 42245 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4050	4038	4017	4017
q2	323	219	218	218
q3	3113	3166	3190	3166
q4	2013	2016	1938	1938
q5	5511	5381	5467	5381
q6	227	133	136	133
q7	2168	1817	1819	1817
q8	3328	3379	3334	3334
q9	8645	8885	8684	8684
q10	3862	3941	4011	3941
q11	556	494	468	468
q12	754	591	570	570
q13	16421	3090	3134	3090
q14	300	280	291	280
q15	522	504	489	489
q16	472	413	438	413
q17	1733	1699	1723	1699
q18	8342	7792	7734	7734
q19	5646	1754	1717	1717
q20	2152	1857	1850	1850
q21	5669	5231	5376	5231
q22	560	483	484	483
Total cold run time: 76367 ms
Total hot run time: 56653 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169461 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2c528ed1b4724b568fe0d463290a1a65a158a086, data reload: false

query1	913	375	370	370
query2	6404	1626	1677	1626
query3	6642	232	222	222
query4	19696	17507	17370	17370
query5	3618	512	521	512
query6	276	174	171	171
query7	4608	311	296	296
query8	273	197	202	197
query9	8499	2401	2364	2364
query10	427	273	266	266
query11	10620	9953	10140	9953
query12	118	89	84	84
query13	1621	370	371	370
query14	8614	8513	7332	7332
query15	226	165	166	165
query16	6946	468	461	461
query17	950	574	545	545
query18	1901	281	278	278
query19	193	147	168	147
query20	90	85	87	85
query21	204	104	101	101
query22	4111	4089	3893	3893
query23	33744	33560	33336	33336
query24	9237	3158	3114	3114
query25	678	433	401	401
query26	1370	156	151	151
query27	3109	288	296	288
query28	7642	2033	2022	2022
query29	999	439	450	439
query30	244	160	158	158
query31	970	754	766	754
query32	103	56	57	56
query33	684	308	324	308
query34	940	504	500	500
query35	921	755	788	755
query36	1065	880	893	880
query37	216	81	85	81
query38	2994	2905	2740	2740
query39	879	836	818	818
query40	252	113	112	112
query41	47	43	44	43
query42	122	98	97	97
query43	454	422	428	422
query44	1202	733	726	726
query45	214	181	179	179
query46	1099	803	779	779
query47	1836	1732	1721	1721
query48	360	284	287	284
query49	854	435	416	416
query50	910	437	435	435
query51	6767	6696	6650	6650
query52	102	86	90	86
query53	259	177	181	177
query54	593	446	442	442
query55	77	72	75	72
query56	275	250	250	250
query57	1157	1021	1053	1021
query58	277	255	274	255
query59	2433	2436	2239	2239
query60	297	269	275	269
query61	101	94	96	94
query62	862	660	678	660
query63	212	180	187	180
query64	4596	1910	1870	1870
query65	3146	3121	3080	3080
query66	1000	341	394	341
query67	15197	14728	14671	14671
query68	4293	553	575	553
query69	458	296	309	296
query70	1106	1035	1088	1035
query71	391	278	276	276
query72	7148	2679	2574	2574
query73	765	336	332	332
query74	6000	5620	5723	5620
query75	3348	2725	2719	2719
query76	2075	1383	1417	1383
query77	425	310	297	297
query78	9352	8880	8834	8834
query79	1353	544	533	533
query80	975	511	518	511
query81	581	223	223	223
query82	1040	131	125	125
query83	239	174	176	174
query84	264	78	78	78
query85	1107	313	307	307
query86	410	315	307	307
query87	3247	3102	3090	3090
query88	2975	2403	2424	2403
query89	381	297	285	285
query90	1680	190	191	190
query91	127	98	104	98
query92	58	48	50	48
query93	1379	630	616	616
query94	740	297	306	297
query95	379	275	262	262
query96	598	287	278	278
query97	3265	3095	3061	3061
query98	216	198	198	198
query99	1614	1278	1307	1278
Total cold run time: 255435 ms
Total hot run time: 169461 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.8 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2c528ed1b4724b568fe0d463290a1a65a158a086, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.03	0.04
query3	0.22	0.05	0.05
query4	1.68	0.07	0.07
query5	0.48	0.49	0.49
query6	1.15	0.71	0.71
query7	0.01	0.01	0.01
query8	0.05	0.05	0.04
query9	0.59	0.52	0.50
query10	0.56	0.57	0.55
query11	0.16	0.11	0.12
query12	0.15	0.13	0.13
query13	0.62	0.60	0.60
query14	0.77	0.78	0.80
query15	0.91	0.88	0.88
query16	0.36	0.36	0.36
query17	1.00	1.00	1.06
query18	0.22	0.20	0.20
query19	1.80	1.75	1.78
query20	0.01	0.00	0.01
query21	15.40	0.78	0.65
query22	3.97	9.01	1.12
query23	17.88	1.37	1.27
query24	2.30	0.22	0.22
query25	0.19	0.08	0.07
query26	0.33	0.22	0.21
query27	0.46	0.24	0.23
query28	13.16	0.99	0.98
query29	12.62	3.32	3.29
query30	0.26	0.05	0.05
query31	2.88	0.41	0.41
query32	3.24	0.48	0.48
query33	2.95	3.00	2.93
query34	15.45	4.28	4.27
query35	4.29	4.26	4.26
query36	0.66	0.49	0.48
query37	0.18	0.16	0.16
query38	0.17	0.14	0.15
query39	0.04	0.03	0.04
query40	0.16	0.13	0.14
query41	0.10	0.04	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 107.65 s
Total hot run time: 29.8 s

@zhangstar333
Copy link
Contributor

and this change need consider upgrading?

@zhiqiang-hhhh
Copy link
Contributor Author

and this change need consider upgrading?

This pr only has impact on all-null block(in other cases, add method of hist data will be called), and in this situation, the original result could be incorrect, so we only need to avoid crash when we do upgrading.

We have two situations here:

  1. new be sink all-null block, old be merge it
  2. old be sink all-null block, new be merge it

In first case, new be will serialize block to a ColumnString, and its max_input_block is -1, old be will do deserialize and merge in its aggregate sink and do get result in its aggregate source. On old be, its merge method actually can not handle the negative max_input_block correctly, and if it happens to be the last block to be merged, we will have trouble ... So it seems we do have upgrading problem in this situation.

In the second case, old be will generate a serialized String with max_input_block is 128(default value on old BE), merge method on new be will merge this data, so the result will still be incorrect but it seems no other impact like be crash in this situation.

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the doing label Aug 1, 2024
Copy link
Contributor

github-actions bot commented Aug 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41742 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9d6edeb050f6fbe8655d1630dada398cfff5eb60, data reload: false

------ Round 1 ----------------------------------
q1	17682	4193	4070	4070
q2	2036	200	203	200
q3	10578	1302	1392	1302
q4	10318	817	917	817
q5	7674	3084	3005	3005
q6	222	138	135	135
q7	1041	610	609	609
q8	9436	1882	1977	1882
q9	8506	6635	6584	6584
q10	8718	3858	3861	3858
q11	433	239	254	239
q12	411	224	223	223
q13	17775	2974	2945	2945
q14	265	244	245	244
q15	519	484	481	481
q16	528	392	385	385
q17	979	950	902	902
q18	8090	7351	7303	7303
q19	1391	1227	1220	1220
q20	543	325	346	325
q21	5396	4832	4731	4731
q22	358	282	284	282
Total cold run time: 112899 ms
Total hot run time: 41742 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4059	4007	4014	4007
q2	333	231	221	221
q3	2999	2992	3168	2992
q4	2044	2073	1987	1987
q5	5610	5465	5442	5442
q6	219	129	127	127
q7	2125	1753	1823	1753
q8	3323	3384	3355	3355
q9	8701	8697	8781	8697
q10	3985	4087	3995	3995
q11	566	467	463	463
q12	764	598	616	598
q13	14490	3092	3101	3092
q14	303	267	280	267
q15	524	510	483	483
q16	469	417	411	411
q17	1795	1764	1732	1732
q18	8267	7947	7664	7664
q19	1753	1714	1731	1714
q20	2057	1869	1850	1850
q21	5813	5364	5400	5364
q22	526	478	473	473
Total cold run time: 70725 ms
Total hot run time: 56687 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170140 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9d6edeb050f6fbe8655d1630dada398cfff5eb60, data reload: false

query1	913	389	365	365
query2	6473	1756	1728	1728
query3	6650	219	235	219
query4	20232	17487	17562	17487
query5	3627	510	513	510
query6	272	170	181	170
query7	4591	300	289	289
query8	252	208	201	201
query9	8507	2375	2369	2369
query10	456	289	268	268
query11	10565	10224	10192	10192
query12	125	90	87	87
query13	1629	370	360	360
query14	9141	7460	7246	7246
query15	198	169	164	164
query16	6963	464	418	418
query17	950	546	541	541
query18	1825	279	283	279
query19	199	145	170	145
query20	96	84	89	84
query21	205	112	107	107
query22	4162	3922	3960	3922
query23	33983	33680	33382	33382
query24	10337	3097	3066	3066
query25	716	429	419	419
query26	1641	148	151	148
query27	2966	291	294	291
query28	7503	2020	1994	1994
query29	1261	440	443	440
query30	237	156	156	156
query31	923	749	790	749
query32	104	61	62	61
query33	672	336	325	325
query34	926	505	543	505
query35	887	750	774	750
query36	1046	900	873	873
query37	291	85	85	85
query38	2975	2770	2749	2749
query39	855	801	817	801
query40	244	114	111	111
query41	45	44	44	44
query42	125	99	101	99
query43	471	423	425	423
query44	1165	737	727	727
query45	209	184	176	176
query46	1093	815	798	798
query47	1797	1749	1717	1717
query48	369	299	292	292
query49	927	418	434	418
query50	904	423	435	423
query51	6852	6696	6765	6696
query52	98	91	97	91
query53	265	183	182	182
query54	641	463	452	452
query55	78	79	74	74
query56	276	255	249	249
query57	1127	1071	1050	1050
query58	274	285	283	283
query59	2651	2487	2530	2487
query60	294	277	294	277
query61	97	97	94	94
query62	884	639	659	639
query63	213	183	178	178
query64	5621	1917	1877	1877
query65	3152	3111	3092	3092
query66	1291	327	332	327
query67	15360	14868	14797	14797
query68	4218	583	573	573
query69	472	313	308	308
query70	1094	1063	1028	1028
query71	428	287	291	287
query72	7171	2664	2479	2479
query73	769	332	333	332
query74	6101	5754	5634	5634
query75	3379	2755	2752	2752
query76	2218	1261	1264	1261
query77	410	303	318	303
query78	9516	8980	8857	8857
query79	2418	542	535	535
query80	1144	536	505	505
query81	565	227	229	227
query82	1051	135	134	134
query83	236	169	173	169
query84	263	78	78	78
query85	1268	306	290	290
query86	445	321	306	306
query87	3312	3110	3086	3086
query88	3119	2386	2424	2386
query89	386	285	298	285
query90	1737	188	194	188
query91	129	100	107	100
query92	64	50	53	50
query93	1961	613	626	613
query94	852	298	310	298
query95	438	279	273	273
query96	613	276	277	276
query97	3194	3056	3041	3041
query98	218	197	198	197
query99	1606	1271	1302	1271
Total cold run time: 262672 ms
Total hot run time: 170140 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.72 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9d6edeb050f6fbe8655d1630dada398cfff5eb60, data reload: false

query1	0.04	0.04	0.04
query2	0.07	0.04	0.04
query3	0.22	0.04	0.05
query4	1.68	0.07	0.07
query5	0.50	0.48	0.49
query6	1.13	0.72	0.71
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.57	0.52	0.52
query10	0.57	0.56	0.58
query11	0.15	0.11	0.11
query12	0.14	0.12	0.12
query13	0.61	0.60	0.60
query14	0.78	0.80	0.80
query15	0.90	0.88	0.86
query16	0.36	0.35	0.35
query17	1.04	1.02	0.98
query18	0.22	0.21	0.21
query19	1.87	1.76	1.72
query20	0.01	0.00	0.00
query21	15.39	0.82	0.66
query22	4.21	8.14	1.04
query23	17.85	1.35	1.23
query24	2.26	0.22	0.21
query25	0.18	0.09	0.07
query26	0.32	0.21	0.22
query27	0.46	0.23	0.22
query28	13.16	1.00	0.97
query29	12.53	3.34	3.35
query30	0.25	0.06	0.05
query31	2.87	0.41	0.42
query32	3.23	0.48	0.49
query33	2.94	3.06	2.98
query34	15.46	4.23	4.27
query35	4.29	4.29	4.31
query36	0.67	0.47	0.47
query37	0.19	0.16	0.17
query38	0.16	0.15	0.16
query39	0.04	0.04	0.03
query40	0.16	0.14	0.12
query41	0.10	0.05	0.05
query42	0.05	0.05	0.05
query43	0.05	0.04	0.05
Total cold run time: 107.75 s
Total hot run time: 29.72 s

@zhiqiang-hhhh
Copy link
Contributor Author

run external

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 5, 2024
Copy link
Contributor

github-actions bot commented Aug 5, 2024

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Aug 5, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HappenLee HappenLee merged commit 76c984a into apache:master Aug 5, 2024
29 of 31 checks passed
zhiqiang-hhhh added a commit to zhiqiang-hhhh/doris that referenced this pull request Aug 5, 2024
…8608)

* Target

Fix unstable result of hist function when involving null value.

* Reproduce

test result of
`regression-test/suites/query_p0/sql_functions/aggregate_functions/test_aggregate_all_functions2.groovy`
is unstable, sql `SELECT histogram(k7, 5) FROM baseall` will sometimes
acts like the second argument is not passed in.

* Root reason

We have short-circuit in AggregateFunctionNullVariadicInline, when this
row is NULL, the value will not be added by the nested function.
Implementation of histogram relies on its add method to get its seconds
argument, when we have an all null value block, histogram will not get
its seconds arg even if sql is like `select(k7, 5)`, so a max_bucket_num
with default value 128 is serialized. When we do merging, and happens to
deserialize the above block at last, the max_bucket_num in merge stage
will be assigned to 128, and this leads to the wrong result.

* Fix by

Init value of max_bucket_num is assigned to 0, when we do merging, we
will discard this aggregated data if its max_bucket_num is 0.
@zhiqiang-hhhh zhiqiang-hhhh deleted the fix-hist branch August 5, 2024 09:19
zhiqiang-hhhh added a commit to zhiqiang-hhhh/doris that referenced this pull request Aug 5, 2024
…8608)

* Target

Fix unstable result of hist function when involving null value.

* Reproduce

test result of
`regression-test/suites/query_p0/sql_functions/aggregate_functions/test_aggregate_all_functions2.groovy`
is unstable, sql `SELECT histogram(k7, 5) FROM baseall` will sometimes
acts like the second argument is not passed in.

* Root reason

We have short-circuit in AggregateFunctionNullVariadicInline, when this
row is NULL, the value will not be added by the nested function.
Implementation of histogram relies on its add method to get its seconds
argument, when we have an all null value block, histogram will not get
its seconds arg even if sql is like `select(k7, 5)`, so a max_bucket_num
with default value 128 is serialized. When we do merging, and happens to
deserialize the above block at last, the max_bucket_num in merge stage
will be assigned to 128, and this leads to the wrong result.

* Fix by

Init value of max_bucket_num is assigned to 0, when we do merging, we
will discard this aggregated data if its max_bucket_num is 0.
dataroaring pushed a commit that referenced this pull request Aug 5, 2024
* Target

Fix unstable result of hist function when involving null value.

* Reproduce

test result of
`regression-test/suites/query_p0/sql_functions/aggregate_functions/test_aggregate_all_functions2.groovy`
is unstable, sql `SELECT histogram(k7, 5) FROM baseall` will sometimes
acts like the second argument is not passed in.

* Root reason

We have short-circuit in AggregateFunctionNullVariadicInline, when this
row is NULL, the value will not be added by the nested function.
Implementation of histogram relies on its add method to get its seconds
argument, when we have an all null value block, histogram will not get
its seconds arg even if sql is like `select(k7, 5)`, so a max_bucket_num
with default value 128 is serialized. When we do merging, and happens to
deserialize the above block at last, the max_bucket_num in merge stage
will be assigned to 128, and this leads to the wrong result.

* Fix by

Init value of max_bucket_num is assigned to 0, when we do merging, we
will discard this aggregated data if its max_bucket_num is 0.
yiguolei pushed a commit that referenced this pull request Aug 6, 2024
@yiguolei yiguolei mentioned this pull request Sep 5, 2024
3 tasks
yiguolei pushed a commit that referenced this pull request Sep 11, 2024
nullable property of histogram on master is changed by
#37330
pick it to branch-2.1
related change on 2.1: #38608,
#38608 relies on AlwaysNotNullable property.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.x dev/2.1.6-merged dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants