Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Jul 11, 2024

Proposed changes

Due to #35838, when executing load job, BE will not sync_rowsets() in publish phase if a compaction job is finished on another BE on the same tablet between the commit phase and the publish phase of the current load job. This PR let the meta service return the tablet compaction stats along with the getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to let the BE know whether it should sync_rowsets() due to compaction on other BEs.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 11, 2024

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 40373 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 44560f6de987c7b6eba61d9f6207a83d0e863a38, data reload: false

------ Round 1 ----------------------------------
q1	18024	4531	4369	4369
q2	2022	194	188	188
q3	10507	1291	1106	1106
q4	10224	849	824	824
q5	7554	2856	2682	2682
q6	223	137	139	137
q7	965	601	608	601
q8	9219	2117	2145	2117
q9	8792	6631	6632	6631
q10	8854	3821	3797	3797
q11	459	248	234	234
q12	412	232	237	232
q13	17762	2997	2995	2995
q14	277	229	248	229
q15	538	496	483	483
q16	497	400	383	383
q17	999	643	681	643
q18	8176	7628	7407	7407
q19	6016	1499	1543	1499
q20	665	327	326	326
q21	5080	3153	3268	3153
q22	396	337	346	337
Total cold run time: 117661 ms
Total hot run time: 40373 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4466	4284	4304	4284
q2	365	278	273	273
q3	3084	2894	2958	2894
q4	2002	1719	1763	1719
q5	5573	5549	5505	5505
q6	233	135	134	134
q7	2264	1842	1803	1803
q8	3329	3450	3461	3450
q9	8840	8900	8786	8786
q10	4210	3803	3927	3803
q11	586	514	496	496
q12	831	685	636	636
q13	16281	3182	3234	3182
q14	323	284	278	278
q15	548	496	492	492
q16	492	428	441	428
q17	1821	1534	1546	1534
q18	8194	8043	7931	7931
q19	1853	1652	1742	1652
q20	2179	1866	1899	1866
q21	5260	4943	4967	4943
q22	678	582	613	582
Total cold run time: 73412 ms
Total hot run time: 56671 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 175142 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 44560f6de987c7b6eba61d9f6207a83d0e863a38, data reload: false

query1	919	368	365	365
query2	6401	2371	2390	2371
query3	6633	212	219	212
query4	28504	17626	17395	17395
query5	3783	482	493	482
query6	263	164	163	163
query7	4573	298	290	290
query8	330	303	298	298
query9	8623	2488	2484	2484
query10	437	299	266	266
query11	10440	10087	10100	10087
query12	118	85	82	82
query13	1641	371	381	371
query14	9922	7657	7694	7657
query15	236	194	191	191
query16	7772	328	322	322
query17	1763	596	541	541
query18	1970	280	285	280
query19	203	159	159	159
query20	93	84	85	84
query21	211	123	128	123
query22	4378	4044	4005	4005
query23	34114	33833	33698	33698
query24	11194	2915	2995	2915
query25	623	500	390	390
query26	701	149	151	149
query27	2271	280	281	280
query28	6395	2170	2158	2158
query29	900	645	633	633
query30	246	158	155	155
query31	952	743	776	743
query32	94	52	56	52
query33	762	318	300	300
query34	984	506	522	506
query35	699	617	600	600
query36	1136	995	965	965
query37	144	80	87	80
query38	2959	2843	2830	2830
query39	904	814	813	813
query40	213	119	115	115
query41	53	49	51	49
query42	114	101	104	101
query43	568	542	524	524
query44	1232	726	737	726
query45	197	167	160	160
query46	1106	730	722	722
query47	1841	1757	1787	1757
query48	369	287	297	287
query49	834	412	425	412
query50	781	415	396	396
query51	6929	6731	6767	6731
query52	98	97	97	97
query53	364	295	301	295
query54	917	459	447	447
query55	75	72	72	72
query56	282	268	282	268
query57	1148	1048	1042	1042
query58	255	240	250	240
query59	3614	3198	3298	3198
query60	302	278	278	278
query61	96	92	93	92
query62	814	644	643	643
query63	331	303	293	293
query64	9250	2205	1659	1659
query65	3220	3112	3138	3112
query66	751	327	323	323
query67	15698	15130	14947	14947
query68	4570	537	530	530
query69	609	413	334	334
query70	1217	1142	1156	1142
query71	433	290	273	273
query72	7665	5550	5865	5550
query73	755	332	326	326
query74	6030	5535	5495	5495
query75	3499	2672	2701	2672
query76	2694	921	943	921
query77	628	300	299	299
query78	9692	9036	8937	8937
query79	3458	525	520	520
query80	2042	472	468	468
query81	592	216	222	216
query82	777	134	139	134
query83	275	180	166	166
query84	273	94	84	84
query85	1318	321	296	296
query86	430	291	284	284
query87	3345	3061	3109	3061
query88	5069	2452	2467	2452
query89	494	382	401	382
query90	1907	192	193	192
query91	133	101	107	101
query92	63	52	49	49
query93	4832	519	508	508
query94	1194	221	214	214
query95	414	325	320	320
query96	612	282	280	280
query97	3197	3051	3013	3013
query98	224	204	192	192
query99	1545	1250	1284	1250
Total cold run time: 285729 ms
Total hot run time: 175142 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.44 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 44560f6de987c7b6eba61d9f6207a83d0e863a38, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.22	0.05	0.04
query4	1.68	0.07	0.07
query5	0.50	0.50	0.48
query6	1.13	0.73	0.73
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.57	0.48	0.49
query10	0.53	0.54	0.54
query11	0.16	0.11	0.11
query12	0.14	0.12	0.13
query13	0.61	0.58	0.58
query14	0.76	0.76	0.79
query15	0.85	0.81	0.83
query16	0.37	0.37	0.34
query17	0.94	1.04	1.04
query18	0.24	0.21	0.22
query19	1.92	1.81	1.81
query20	0.01	0.01	0.01
query21	15.43	0.76	0.65
query22	4.44	7.50	1.76
query23	18.24	1.44	1.22
query24	2.16	0.24	0.21
query25	0.16	0.08	0.09
query26	0.29	0.21	0.21
query27	0.45	0.23	0.24
query28	13.26	1.01	1.00
query29	12.64	3.36	3.34
query30	0.25	0.06	0.06
query31	2.87	0.39	0.39
query32	3.26	0.47	0.46
query33	2.90	2.84	2.94
query34	17.09	4.32	4.36
query35	4.44	4.37	4.44
query36	0.65	0.48	0.47
query37	0.18	0.15	0.16
query38	0.15	0.15	0.15
query39	0.05	0.03	0.04
query40	0.14	0.12	0.12
query41	0.09	0.04	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 110.07 s
Total hot run time: 30.44 s

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 11, 2024

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39991 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f8555c4d298870c21046a70bc5780ff1e5411ba5, data reload: false

------ Round 1 ----------------------------------
q1	17612	4530	4302	4302
q2	2026	194	192	192
q3	10446	1162	1088	1088
q4	10203	784	725	725
q5	7532	2701	2665	2665
q6	227	133	136	133
q7	964	606	615	606
q8	9227	2121	2065	2065
q9	8762	6539	6579	6539
q10	8788	3845	3725	3725
q11	516	242	242	242
q12	395	238	229	229
q13	17766	2973	2969	2969
q14	288	237	246	237
q15	528	498	495	495
q16	491	395	380	380
q17	962	677	794	677
q18	8162	7542	7413	7413
q19	6363	1561	1448	1448
q20	674	332	335	332
q21	5051	3190	3277	3190
q22	403	344	339	339
Total cold run time: 117386 ms
Total hot run time: 39991 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4465	4269	4310	4269
q2	373	276	269	269
q3	2997	2908	2945	2908
q4	2019	1659	1742	1659
q5	5625	5552	5480	5480
q6	230	132	135	132
q7	2258	1875	1848	1848
q8	3270	3439	3407	3407
q9	8850	8927	8813	8813
q10	4189	3766	3863	3766
q11	602	508	500	500
q12	793	671	650	650
q13	16278	3169	3172	3169
q14	322	280	287	280
q15	538	492	492	492
q16	492	438	439	438
q17	1824	1529	1514	1514
q18	8253	7985	7803	7803
q19	1789	1528	1695	1528
q20	2132	1829	1846	1829
q21	5188	4758	4828	4758
q22	633	547	551	547
Total cold run time: 73120 ms
Total hot run time: 56059 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 175580 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f8555c4d298870c21046a70bc5780ff1e5411ba5, data reload: false

query1	928	371	369	369
query2	6434	2523	2477	2477
query3	6636	213	224	213
query4	28285	17442	17195	17195
query5	3752	497	479	479
query6	258	185	157	157
query7	4576	296	294	294
query8	311	292	301	292
query9	8521	2480	2471	2471
query10	440	286	271	271
query11	11180	9962	10022	9962
query12	117	86	84	84
query13	1639	375	383	375
query14	10483	8035	7825	7825
query15	250	191	187	187
query16	7202	338	330	330
query17	1751	582	551	551
query18	1766	291	284	284
query19	208	159	153	153
query20	90	81	87	81
query21	213	131	129	129
query22	4281	4162	4004	4004
query23	34077	34651	33889	33889
query24	11574	3005	2994	2994
query25	518	389	376	376
query26	1095	153	157	153
query27	2671	285	283	283
query28	7233	2178	2197	2178
query29	903	641	637	637
query30	252	156	152	152
query31	963	764	764	764
query32	98	52	55	52
query33	751	305	319	305
query34	995	516	512	512
query35	690	584	587	584
query36	1150	1012	994	994
query37	145	82	87	82
query38	2947	2871	2786	2786
query39	899	865	825	825
query40	265	120	126	120
query41	56	54	50	50
query42	124	99	99	99
query43	581	549	563	549
query44	1182	737	728	728
query45	198	166	165	165
query46	1086	755	728	728
query47	1867	1771	1773	1771
query48	395	313	304	304
query49	871	410	414	410
query50	786	417	395	395
query51	6876	6744	6622	6622
query52	107	93	96	93
query53	367	294	296	294
query54	904	449	444	444
query55	76	74	76	74
query56	287	261	266	261
query57	1148	1021	1088	1021
query58	261	255	247	247
query59	3423	3355	3219	3219
query60	294	277	274	274
query61	99	90	121	90
query62	813	639	630	630
query63	327	292	292	292
query64	10269	2245	1680	1680
query65	3185	3087	3097	3087
query66	977	326	337	326
query67	15555	15183	15107	15107
query68	4622	534	542	534
query69	576	374	338	338
query70	1130	1142	1048	1048
query71	447	277	274	274
query72	7952	5945	5771	5771
query73	745	331	335	331
query74	6053	5516	5541	5516
query75	4099	2673	2703	2673
query76	3190	943	875	875
query77	681	307	300	300
query78	9773	8965	8890	8890
query79	2864	526	521	521
query80	2231	478	477	477
query81	614	220	223	220
query82	905	137	144	137
query83	308	173	167	167
query84	269	93	87	87
query85	1304	368	297	297
query86	446	325	330	325
query87	3260	3084	3047	3047
query88	3867	2465	2476	2465
query89	486	383	386	383
query90	1831	199	197	197
query91	133	101	103	101
query92	67	50	49	49
query93	2611	513	513	513
query94	1162	212	217	212
query95	417	321	314	314
query96	610	285	273	273
query97	3174	2989	2989	2989
query98	225	202	195	195
query99	1494	1232	1270	1232
Total cold run time: 286207 ms
Total hot run time: 175580 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.06 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f8555c4d298870c21046a70bc5780ff1e5411ba5, data reload: false

query1	0.03	0.03	0.04
query2	0.07	0.03	0.04
query3	0.22	0.05	0.05
query4	1.68	0.07	0.07
query5	0.50	0.49	0.49
query6	1.13	0.72	0.73
query7	0.02	0.01	0.01
query8	0.06	0.05	0.04
query9	0.55	0.48	0.50
query10	0.54	0.56	0.56
query11	0.15	0.11	0.11
query12	0.14	0.12	0.12
query13	0.58	0.58	0.58
query14	0.78	0.77	0.82
query15	0.85	0.81	0.81
query16	0.36	0.38	0.37
query17	1.05	1.06	1.07
query18	0.24	0.22	0.22
query19	1.84	1.81	1.69
query20	0.01	0.01	0.01
query21	15.41	0.78	0.68
query22	4.41	7.00	2.27
query23	18.23	1.30	1.31
query24	2.09	0.23	0.22
query25	0.15	0.10	0.08
query26	0.31	0.22	0.21
query27	0.46	0.23	0.23
query28	13.33	1.02	1.01
query29	12.63	3.32	3.27
query30	0.26	0.06	0.05
query31	2.87	0.40	0.38
query32	3.29	0.47	0.46
query33	2.93	2.96	2.91
query34	17.08	4.33	4.36
query35	4.42	4.40	4.47
query36	0.66	0.46	0.46
query37	0.18	0.16	0.16
query38	0.16	0.15	0.15
query39	0.04	0.04	0.03
query40	0.15	0.12	0.11
query41	0.10	0.04	0.05
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 110.06 s
Total hot run time: 31.06 s

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 12, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 5f51a85 into apache:master Jul 13, 2024
dataroaring pushed a commit that referenced this pull request Jul 13, 2024
…phase if compaction on other BE finished during this load (#37670)

## Proposed changes
Due to #35838, when executing load
job, BE will not `sync_rowsets()` in publish phase if a compaction job
is finished on another BE on the same tablet between the commit phase
and the publish phase of the current load job. This PR let the meta
service return the tablet compaction stats along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should `sync_rowsets()` due to compaction on
other BEs.
seawinde pushed a commit to seawinde/doris that referenced this pull request Jul 17, 2024
…phase if compaction on other BE finished during this load (apache#37670)

## Proposed changes
Due to apache#35838, when executing load
job, BE will not `sync_rowsets()` in publish phase if a compaction job
is finished on another BE on the same tablet between the commit phase
and the publish phase of the current load job. This PR let the meta
service return the tablet compaction stats along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should `sync_rowsets()` due to compaction on
other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 18, 2024
…publish phase if compaction on other BE finished during this load (apache#37670)"

This reverts commit 5f51a85.
dataroaring pushed a commit that referenced this pull request Aug 7, 2024
## Proposed changes

fix a typo in #37670
dataroaring pushed a commit that referenced this pull request Aug 7, 2024
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Aug 14, 2024
@gavinchou gavinchou mentioned this pull request Aug 19, 2024
dataroaring pushed a commit that referenced this pull request Aug 28, 2024
…ad in calcDeleteBitmapForMow (#39791)

## Proposed changes

Issue Number: close #xxx

#37670 Let FE call get_delete_bitmap_update_lock in
calcDeleteBitmapForMow to get the latest compaction stats of the
corresponding partition at the same time, so that in the downstream
calcDelete bitmap task, it can let BE determine if there is a concurrent
compaction conflict.

However, this PR uses snapshot read when fetching the partition stats,
which makes it possible for us to fetch outdated stats, thus allowing
the BE to miss the conflict processing of concurrent compaction, and
generating duplicate keys.
dataroaring pushed a commit that referenced this pull request Aug 28, 2024
…ad in calcDeleteBitmapForMow (#39791)

## Proposed changes

Issue Number: close #xxx

#37670 Let FE call get_delete_bitmap_update_lock in
calcDeleteBitmapForMow to get the latest compaction stats of the
corresponding partition at the same time, so that in the downstream
calcDelete bitmap task, it can let BE determine if there is a concurrent
compaction conflict.

However, this PR uses snapshot read when fetching the partition stats,
which makes it possible for us to fetch outdated stats, thus allowing
the BE to miss the conflict processing of concurrent compaction, and
generating duplicate keys.
gavinchou pushed a commit that referenced this pull request Dec 17, 2024
…ck be able to be in different fdb txns (#45206)

#37670 let the meta service return
the tablet compaction stats along with the
`getDeleteBitmapUpdateLockResponse` to FE to let the BE know whether it
should `sync_rowsets()` due to successful compaction on other BEs on the
same tablet. That PR makes the process of reading tablets' stats and
writing the delete bitmap update lock KV in one fdb txn to achieve the
atomic sematic. However, when a load involves a large number of tablets,
the process of reading tablets' stats may take longer than fdb txn's 5
seconds limitation and cause `TXN_TOO_OLD` error.
This PR re-arrange the process so that the read of tablet stats can be
not necessarily in the same fdb txn with the txn which update the
lock_info.lock_id. In detail, we do as the following:
1. gain the delete bitmap update lock in MS (write delete bitmap update
lock KV)
2. read tablets' stats to get the compaction counts.
3. check if the delete bitmap update lock is still held by the current
load.
github-actions bot pushed a commit that referenced this pull request Dec 17, 2024
…ck be able to be in different fdb txns (#45206)

#37670 let the meta service return
the tablet compaction stats along with the
`getDeleteBitmapUpdateLockResponse` to FE to let the BE know whether it
should `sync_rowsets()` due to successful compaction on other BEs on the
same tablet. That PR makes the process of reading tablets' stats and
writing the delete bitmap update lock KV in one fdb txn to achieve the
atomic sematic. However, when a load involves a large number of tablets,
the process of reading tablets' stats may take longer than fdb txn's 5
seconds limitation and cause `TXN_TOO_OLD` error.
This PR re-arrange the process so that the read of tablet stats can be
not necessarily in the same fdb txn with the txn which update the
lock_info.lock_id. In detail, we do as the following:
1. gain the delete bitmap update lock in MS (write delete bitmap update
lock KV)
2. read tablets' stats to get the compaction counts.
3. check if the delete bitmap update lock is still held by the current
load.
zhannngchen pushed a commit that referenced this pull request Mar 5, 2025
…c_rowsets` in publish phase (#48400)

### What problem does this PR solve?

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like #37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 5, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 6, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
dataroaring pushed a commit to bobhan1/doris that referenced this pull request Mar 7, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 10, 2025
…c_rowsets` in publish phase (apache#48400)

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…c_rowsets` in publish phase (apache#48400)

### What problem does this PR solve?

considering the following situation:

1. heavy SC begins
2. alter task on tablet X(to tablet Y) is sent to be1
3. be1 shutdown for some reason
4. new loads on new tablet Y are routed to be2(which will skip to
calculate delete bitmaps in commit phase and publish phase because the
tablet's state is `NOT_READY`)
5. be1 restarted and resumed to do alter task
6. alter task on be1 finished and change the tablet's state to `RUNNING`
in MS
7. some load on tablet Y on be2 skip to calculate delete bitmap because
it doesn't know the tablet's state has changed, which will cause
duplicate key problem

Like apache#37670, this PR let the meta
service return the tablet states along with the
getDeleteBitmapUpdateLockResponse to FE and FE will send them to BE to
let the BE know whether it should sync_rowsets() due to tablet state
change on other BEs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.0-merged meta-change reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants