Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Feb 26, 2025

What problem does this PR solve?

considering the following situation:

  1. heavy SC begins
  2. alter task on tablet X(to tablet Y) is sent to be1
  3. be1 shutdown for some reason
  4. new loads on new tablet Y are routed to be2(which will skip to calculate delete bitmaps in commit phase and publish phase because the tablet's state is NOT_READY)
  5. be1 restarted and resumed to do alter task
  6. alter task on be1 finished and change the tablet's state to RUNNING in MS
  7. some load on tablet Y on be2 skip to calculate delete bitmap because it doesn't know the tablet's state has changed, which will cause duplicate key problem

Like #37670, this PR let the meta service return the tablet states along with the getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to let the BE know whether it should sync_rowsets() due to tablet state change on other BEs.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 26, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the fix-tablet-state-change-in-publish branch 2 times, most recently from ca25fc3 to 7ba25fc Compare February 27, 2025 08:37
@bobhan1 bobhan1 force-pushed the fix-tablet-state-change-in-publish branch from 7ba25fc to 499e9f3 Compare February 27, 2025 08:54
@bobhan1 bobhan1 marked this pull request as ready for review February 27, 2025 09:08
@bobhan1
Copy link
Contributor Author

bobhan1 commented Feb 27, 2025

run buildall

@bobhan1
Copy link
Contributor Author

bobhan1 commented Feb 27, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32013 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2557ed6b1b6c79650393690cf861faa7348e2a1b, data reload: false

------ Round 1 ----------------------------------
q1	17646	5571	5106	5106
q2	2053	310	175	175
q3	10389	1344	721	721
q4	10227	1031	559	559
q5	7559	2659	2358	2358
q6	192	165	132	132
q7	951	748	641	641
q8	9317	1358	1183	1183
q9	5002	4715	4818	4715
q10	6875	2323	1905	1905
q11	497	289	271	271
q12	351	358	218	218
q13	17759	3749	3114	3114
q14	233	239	211	211
q15	519	472	454	454
q16	657	629	597	597
q17	593	886	367	367
q18	6714	6259	6200	6200
q19	1922	966	588	588
q20	325	336	193	193
q21	2918	2394	1986	1986
q22	378	341	319	319
Total cold run time: 103077 ms
Total hot run time: 32013 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5247	5177	5179	5177
q2	241	337	231	231
q3	2204	2686	2315	2315
q4	1473	1958	1413	1413
q5	4264	4151	4210	4151
q6	217	164	124	124
q7	1870	1956	1817	1817
q8	2620	2615	2680	2615
q9	7307	7173	7089	7089
q10	3048	3241	2831	2831
q11	600	521	507	507
q12	694	765	617	617
q13	3571	3869	3212	3212
q14	270	314	276	276
q15	505	488	470	470
q16	647	688	657	657
q17	1191	1633	1343	1343
q18	7596	7550	7204	7204
q19	846	881	1030	881
q20	2003	2039	1936	1936
q21	5579	5034	5028	5028
q22	657	574	583	574
Total cold run time: 52650 ms
Total hot run time: 50468 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184351 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2557ed6b1b6c79650393690cf861faa7348e2a1b, data reload: false

query1	1012	389	381	381
query2	6562	1968	1897	1897
query3	6806	211	211	211
query4	26285	23273	23223	23223
query5	4367	671	504	504
query6	297	182	173	173
query7	4598	501	304	304
query8	287	240	237	237
query9	8599	2577	2581	2577
query10	479	330	256	256
query11	15387	15335	15010	15010
query12	164	112	107	107
query13	1673	530	388	388
query14	9431	6474	6254	6254
query15	208	196	178	178
query16	7171	662	478	478
query17	1190	715	557	557
query18	1966	408	314	314
query19	205	192	157	157
query20	128	120	118	118
query21	207	124	103	103
query22	4170	4295	4015	4015
query23	33849	33102	32986	32986
query24	7740	2409	2375	2375
query25	561	494	409	409
query26	1249	277	157	157
query27	2103	502	327	327
query28	3908	2450	2402	2402
query29	750	580	456	456
query30	238	242	157	157
query31	910	856	769	769
query32	78	66	67	66
query33	563	358	306	306
query34	787	855	502	502
query35	783	820	731	731
query36	966	973	902	902
query37	116	97	80	80
query38	4072	4201	4173	4173
query39	1466	1392	1420	1392
query40	259	120	104	104
query41	55	52	50	50
query42	123	105	104	104
query43	505	509	497	497
query44	1250	783	770	770
query45	178	168	163	163
query46	864	1047	633	633
query47	1768	1788	1705	1705
query48	374	414	304	304
query49	783	531	419	419
query50	693	748	418	418
query51	4236	4207	4153	4153
query52	106	104	98	98
query53	233	256	186	186
query54	526	483	399	399
query55	89	88	87	87
query56	256	287	243	243
query57	1128	1132	1073	1073
query58	251	245	241	241
query59	2713	2727	2743	2727
query60	271	270	264	264
query61	125	117	121	117
query62	810	713	658	658
query63	242	188	184	184
query64	4248	1019	709	709
query65	3273	3157	3159	3157
query66	1127	412	311	311
query67	15732	15704	15446	15446
query68	8286	883	504	504
query69	474	307	261	261
query70	1201	1116	1134	1116
query71	450	306	264	264
query72	5513	3541	3630	3541
query73	796	714	354	354
query74	9104	9137	8710	8710
query75	3758	3217	2705	2705
query76	3614	1169	753	753
query77	797	368	295	295
query78	9822	10143	9350	9350
query79	1509	831	597	597
query80	615	529	465	465
query81	491	280	244	244
query82	306	127	96	96
query83	172	176	160	160
query84	232	93	81	81
query85	786	373	313	313
query86	337	313	276	276
query87	4485	4402	4328	4328
query88	2896	2244	2235	2235
query89	395	321	284	284
query90	2064	212	199	199
query91	140	139	112	112
query92	77	66	64	64
query93	1180	1082	585	585
query94	678	419	307	307
query95	365	277	269	269
query96	492	561	269	269
query97	3258	3389	3284	3284
query98	216	201	203	201
query99	1380	1412	1277	1277
Total cold run time: 269438 ms
Total hot run time: 184351 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.27 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2557ed6b1b6c79650393690cf861faa7348e2a1b, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.03	0.03
query3	0.23	0.08	0.06
query4	1.62	0.11	0.11
query5	0.56	0.56	0.56
query6	1.17	0.73	0.73
query7	0.02	0.02	0.02
query8	0.04	0.02	0.04
query9	0.58	0.54	0.52
query10	0.57	0.57	0.56
query11	0.15	0.11	0.10
query12	0.15	0.11	0.11
query13	0.64	0.60	0.60
query14	2.78	2.81	2.79
query15	0.93	0.87	0.86
query16	0.38	0.37	0.37
query17	1.02	1.02	1.04
query18	0.21	0.19	0.19
query19	1.95	1.91	1.79
query20	0.02	0.01	0.01
query21	15.35	0.93	0.56
query22	0.75	1.16	0.63
query23	15.02	1.42	0.67
query24	6.64	1.38	1.07
query25	0.48	0.26	0.14
query26	0.62	0.16	0.13
query27	0.05	0.05	0.04
query28	10.21	0.89	0.43
query29	12.63	4.02	3.28
query30	0.26	0.09	0.08
query31	2.81	0.60	0.38
query32	3.22	0.56	0.47
query33	3.07	3.05	3.04
query34	15.63	5.19	4.56
query35	4.54	4.53	4.60
query36	0.68	0.51	0.48
query37	0.10	0.07	0.06
query38	0.05	0.04	0.03
query39	0.03	0.02	0.03
query40	0.17	0.13	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.02	0.03
Total cold run time: 105.6 s
Total hot run time: 31.27 s

@bobhan1 bobhan1 force-pushed the fix-tablet-state-change-in-publish branch from 7a66937 to b0dce74 Compare February 27, 2025 11:17
@bobhan1
Copy link
Contributor Author

bobhan1 commented Feb 27, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.28% (1063/1292)
Line Coverage: 65.79% (17622/26785)
Region Coverage: 65.30% (8684/13298)
Branch Coverage: 55.26% (4686/8480)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b0dce742760abb6c768674a541d1169f7e0d3029_b0dce742760abb6c768674a541d1169f7e0d3029_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 31554 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b0dce742760abb6c768674a541d1169f7e0d3029, data reload: false

------ Round 1 ----------------------------------
q1	17667	5346	5078	5078
q2	2056	299	185	185
q3	10892	1238	740	740
q4	10442	1016	527	527
q5	9283	2414	2320	2320
q6	202	172	139	139
q7	918	769	605	605
q8	9318	1281	1083	1083
q9	4976	4745	4712	4712
q10	6817	2321	1889	1889
q11	478	274	252	252
q12	344	356	220	220
q13	17775	3696	3048	3048
q14	231	233	202	202
q15	504	462	441	441
q16	624	621	607	607
q17	590	847	353	353
q18	6740	6284	6178	6178
q19	1462	938	528	528
q20	318	324	191	191
q21	2736	2155	1955	1955
q22	367	328	301	301
Total cold run time: 104740 ms
Total hot run time: 31554 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5225	5136	5108	5108
q2	243	333	229	229
q3	2113	2672	2308	2308
q4	1443	1827	1390	1390
q5	4243	4139	4131	4131
q6	207	171	126	126
q7	1879	1931	1781	1781
q8	2619	2745	2660	2660
q9	7295	7249	7211	7211
q10	2990	3234	2772	2772
q11	573	525	492	492
q12	717	777	586	586
q13	3351	4005	3281	3281
q14	281	307	268	268
q15	499	468	479	468
q16	656	694	642	642
q17	1135	1566	1362	1362
q18	7521	7374	7313	7313
q19	799	841	1013	841
q20	1947	2035	1892	1892
q21	5542	4945	4728	4728
q22	668	578	535	535
Total cold run time: 51946 ms
Total hot run time: 50124 ms

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Mar 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2025

PR approved by anyone and no changes requested.

@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 4, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.15% (1063/1294)
Line Coverage: 65.69% (17628/26834)
Region Coverage: 65.14% (8687/13335)
Branch Coverage: 55.15% (4691/8506)
Coverage Report: http://coverage.selectdb-in.cc/coverage/303bc8f1233c3d99f347b6f8d558b4178abe2f5b_303bc8f1233c3d99f347b6f8d558b4178abe2f5b_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 32038 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 303bc8f1233c3d99f347b6f8d558b4178abe2f5b, data reload: false

------ Round 1 ----------------------------------
q1	17607	5352	5201	5201
q2	2057	317	187	187
q3	10371	1337	780	780
q4	10208	1113	552	552
q5	7543	2473	2358	2358
q6	208	171	143	143
q7	947	784	617	617
q8	9313	1482	1199	1199
q9	5016	4691	4775	4691
q10	6864	2335	1924	1924
q11	475	284	260	260
q12	349	370	228	228
q13	17774	3726	3059	3059
q14	219	220	213	213
q15	543	456	447	447
q16	626	603	578	578
q17	612	927	351	351
q18	6888	6157	6202	6157
q19	1653	989	583	583
q20	327	324	187	187
q21	2889	2272	2014	2014
q22	366	335	309	309
Total cold run time: 102855 ms
Total hot run time: 32038 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5373	5320	5342	5320
q2	242	360	230	230
q3	2223	2748	2294	2294
q4	1480	1848	1426	1426
q5	4247	4135	4155	4135
q6	224	166	125	125
q7	1878	1912	1787	1787
q8	2676	2743	2603	2603
q9	7293	7219	7218	7218
q10	3057	3245	2785	2785
q11	585	515	499	499
q12	734	784	631	631
q13	3473	3927	3266	3266
q14	285	294	258	258
q15	562	467	452	452
q16	647	696	660	660
q17	1187	1681	1334	1334
q18	7640	7324	7321	7321
q19	911	867	1009	867
q20	2009	2042	1868	1868
q21	5584	5015	4780	4780
q22	668	569	556	556
Total cold run time: 52978 ms
Total hot run time: 50415 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191000 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 303bc8f1233c3d99f347b6f8d558b4178abe2f5b, data reload: false

query1	1302	985	931	931
query2	6168	1879	1832	1832
query3	11001	4587	4430	4430
query4	57740	25599	23334	23334
query5	5048	496	481	481
query6	313	190	180	180
query7	4880	496	291	291
query8	304	243	230	230
query9	5634	2570	2566	2566
query10	441	317	262	262
query11	15066	15070	15043	15043
query12	161	111	107	107
query13	1042	515	381	381
query14	11104	6517	6902	6517
query15	214	202	178	178
query16	7145	647	488	488
query17	1078	731	580	580
query18	1521	409	298	298
query19	197	194	168	168
query20	135	127	118	118
query21	206	126	107	107
query22	4793	4721	4448	4448
query23	33928	33320	33315	33315
query24	5737	2449	2425	2425
query25	478	489	402	402
query26	683	280	158	158
query27	1840	543	343	343
query28	2759	2493	2437	2437
query29	582	545	434	434
query30	212	192	157	157
query31	891	892	811	811
query32	72	69	61	61
query33	434	372	329	329
query34	777	880	515	515
query35	805	826	763	763
query36	928	996	906	906
query37	118	95	73	73
query38	4271	4209	4223	4209
query39	1525	1433	1418	1418
query40	209	113	105	105
query41	53	49	48	48
query42	124	105	108	105
query43	505	514	466	466
query44	1356	846	815	815
query45	179	177	166	166
query46	922	1069	650	650
query47	1820	1860	1810	1810
query48	410	423	305	305
query49	718	531	410	410
query50	720	761	411	411
query51	4287	4291	4260	4260
query52	103	109	98	98
query53	244	255	197	197
query54	516	500	436	436
query55	82	77	79	77
query56	277	259	270	259
query57	1146	1193	1104	1104
query58	240	254	238	238
query59	2632	2810	2732	2732
query60	290	283	293	283
query61	146	140	153	140
query62	770	726	719	719
query63	233	202	203	202
query64	1564	1139	790	790
query65	3305	3299	3207	3207
query66	749	395	284	284
query67	15874	15586	15516	15516
query68	7840	873	508	508
query69	540	306	270	270
query70	1192	1125	1037	1037
query71	476	301	280	280
query72	5963	3546	3770	3546
query73	1449	744	358	358
query74	9268	9116	8821	8821
query75	3645	3186	2687	2687
query76	4246	1192	762	762
query77	621	373	281	281
query78	9959	10099	9275	9275
query79	2466	837	601	601
query80	681	526	437	437
query81	519	280	246	246
query82	702	124	93	93
query83	188	175	157	157
query84	290	98	79	79
query85	852	348	299	299
query86	414	305	274	274
query87	4360	4379	4469	4379
query88	3850	2222	2205	2205
query89	415	318	289	289
query90	1801	194	189	189
query91	149	145	107	107
query92	71	61	60	60
query93	1898	1023	591	591
query94	660	415	258	258
query95	349	270	261	261
query96	483	553	267	267
query97	3328	3393	3369	3369
query98	224	205	207	205
query99	1473	1413	1250	1250
Total cold run time: 301379 ms
Total hot run time: 191000 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 303bc8f1233c3d99f347b6f8d558b4178abe2f5b, data reload: false

query1	0.04	0.03	0.03
query2	0.11	0.04	0.05
query3	0.28	0.05	0.05
query4	1.61	0.07	0.07
query5	0.58	0.54	0.53
query6	1.18	0.73	0.73
query7	0.02	0.01	0.02
query8	0.06	0.05	0.05
query9	0.62	0.52	0.53
query10	0.57	0.58	0.57
query11	0.25	0.12	0.13
query12	0.25	0.13	0.13
query13	0.63	0.62	0.61
query14	2.73	2.84	2.69
query15	0.98	0.87	0.87
query16	0.37	0.37	0.36
query17	1.04	1.08	1.04
query18	0.18	0.19	0.18
query19	1.97	1.93	1.93
query20	0.02	0.02	0.02
query21	15.71	0.96	0.65
query22	0.99	1.09	0.82
query23	14.71	1.53	0.74
query24	5.34	0.61	0.29
query25	0.16	0.09	0.08
query26	0.56	0.22	0.18
query27	0.09	0.09	0.08
query28	11.03	1.16	0.54
query29	12.54	3.99	3.35
query30	0.27	0.07	0.06
query31	2.82	0.60	0.42
query32	3.21	0.58	0.49
query33	3.04	3.04	3.07
query34	16.81	5.13	4.40
query35	4.42	4.46	4.49
query36	0.63	0.51	0.50
query37	0.20	0.17	0.16
query38	0.17	0.15	0.14
query39	0.05	0.04	0.04
query40	0.20	0.15	0.15
query41	0.11	0.05	0.05
query42	0.06	0.06	0.06
query43	0.06	0.05	0.04
Total cold run time: 106.67 s
Total hot run time: 31.29 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 45.83% (12240/26707)
Line Coverage 35.34% (103490/292854)
Region Coverage 34.51% (53036/153690)
Branch Coverage 30.22% (26872/88926)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 4, 2025

run p0

1 similar comment
@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 4, 2025

run p0

@zhannngchen zhannngchen merged commit 54e7e94 into apache:master Mar 5, 2025
25 of 27 checks passed
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 5, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 6, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
dataroaring pushed a commit to bobhan1/doris that referenced this pull request Mar 7, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 10, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
hello-stephen pushed a commit that referenced this pull request Mar 10, 2025
…ther to skip `sync_rowsets` in publish phase (#48400) (#48667)

pick #48400
dataroaring pushed a commit that referenced this pull request Mar 18, 2025
…t tablet states for `GetDeleteBitmapUpdateLockResponse` (#49165)

### What problem does this PR solve?

fix for #48400, when fe send
`GetDeleteBitmapUpdateLock` rpc to low version MS which will not set
tablet states field and get response from it, FE will encounter
`IndexOutOfBoundsException`.
```
2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 
```
github-actions bot pushed a commit that referenced this pull request Mar 18, 2025
…t tablet states for `GetDeleteBitmapUpdateLockResponse` (#49165)

### What problem does this PR solve?

fix for #48400, when fe send
`GetDeleteBitmapUpdateLock` rpc to low version MS which will not set
tablet states field and get response from it, FE will encounter
`IndexOutOfBoundsException`.
```
2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 
```
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…c_rowsets` in publish phase (apache#48400)

### What problem does this PR solve?

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…t tablet states for `GetDeleteBitmapUpdateLockResponse` (apache#49165)

### What problem does this PR solve?

fix for apache#48400, when fe send
`GetDeleteBitmapUpdateLock` rpc to low version MS which will not set
tablet states field and get response from it, FE will encounter
`IndexOutOfBoundsException`.
```
2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.5-merged meta-change p0_w reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants