Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #55092

…ped segments (#55092)

### What problem does this PR solve?

Fix segment number mismatch caused by erroneously skipped segments
during concurrent incremental open on auto-partitioned table:

#### Problem
During concurrent incremental open on an auto-partitioned table, one
sink may incorrectly assume that stream opened by another sink have
already been opened and begin writing data while those segments are
still being opened. This leads to some segments being silently skipped
and results in a segment number mismatch. For example(two instances, 4
BEs: a, b, c, d):
| Time | Event |
| ---- |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| t0 | `sink1` and `sink2` start incremental open for BEs **a, b, c,
d**. |
| t1 | `sink1` adds **a, b, c** to `_load_stream_map` and initiates
open. |
| t2 | `sink2` adds **d** to `_load_stream_map` and initiates open. |
| t3 | `sink1` completes open for **a** and **b**; **c** is still in
progress. |
| t4 | `sink2` successfully opens **d**, assumes **a, b, c** are **all**
ready, and starts writing. Because **c** is not yet fully open, its
segments are skipped, causing the mismatch. |

#### Expected behavior
A sink must wait until all stream it depends on are fully opened before
starting any write.

#### Proposed fix
All sinks open the full set of streams (a, b, c, d) instead of a partial
subset. Lock on each stream guarantees that:
- Duplicate open attempts are prevented:only the first sink performs the
actual open; subsequent sinks wait until the open is complete.
- Expected behavior is preserved:every sink waits until all streams are
fully opened before starting any write, eliminating skipped segments and
the resulting segment-number mismatch.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@github-actions github-actions bot requested a review from morrySnow as a code owner August 29, 2025 07:36
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Aug 29, 2025
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32730 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit efc78a20d56ac846d6d9ac4e685b72c8d63b41d6, data reload: false

------ Round 1 ----------------------------------
q1	17584	5473	5661	5473
q2	2029	394	290	290
q3	12111	1257	775	775
q4	10336	870	465	465
q5	9514	2394	2168	2168
q6	189	163	132	132
q7	916	742	632	632
q8	9346	1455	1175	1175
q9	5203	4968	4871	4871
q10	6765	2285	1799	1799
q11	482	288	274	274
q12	346	357	221	221
q13	17797	3560	3005	3005
q14	223	221	216	216
q15	530	473	464	464
q16	427	431	373	373
q17	630	857	380	380
q18	6901	6416	6282	6282
q19	1212	945	558	558
q20	326	351	210	210
q21	3088	2240	1994	1994
q22	1059	1011	973	973
Total cold run time: 107014 ms
Total hot run time: 32730 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5489	5479	5480	5479
q2	233	323	231	231
q3	2240	2608	2352	2352
q4	1351	1818	1319	1319
q5	4563	5075	5054	5054
q6	177	167	134	134
q7	2103	2034	1861	1861
q8	2689	2826	2745	2745
q9	7265	7253	7198	7198
q10	3060	3294	2738	2738
q11	595	531	486	486
q12	686	781	583	583
q13	3381	3798	3208	3208
q14	278	301	281	281
q15	518	476	457	457
q16	432	496	435	435
q17	1249	1694	1293	1293
q18	7620	7440	7264	7264
q19	844	1077	1136	1077
q20	2029	2058	1919	1919
q21	5319	4887	4505	4505
q22	1079	1040	983	983
Total cold run time: 53200 ms
Total hot run time: 51602 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192710 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit efc78a20d56ac846d6d9ac4e685b72c8d63b41d6, data reload: false

query1	949	384	385	384
query2	6242	1884	1861	1861
query3	8700	201	197	197
query4	33525	23708	23448	23448
query5	3746	608	447	447
query6	291	184	179	179
query7	4196	517	321	321
query8	322	253	243	243
query9	9486	2605	2596	2596
query10	487	333	284	284
query11	18352	15403	15255	15255
query12	161	112	106	106
query13	1555	560	411	411
query14	9100	6795	7364	6795
query15	253	195	185	185
query16	8582	670	493	493
query17	1533	768	605	605
query18	2313	433	331	331
query19	227	206	168	168
query20	129	121	118	118
query21	207	130	114	114
query22	4597	4438	4350	4350
query23	35018	34415	34726	34415
query24	7237	2780	2781	2780
query25	506	501	433	433
query26	756	260	176	176
query27	2063	503	358	358
query28	5185	2207	2192	2192
query29	645	552	473	473
query30	241	192	168	168
query31	1024	920	815	815
query32	86	58	63	58
query33	497	374	322	322
query34	810	872	518	518
query35	788	860	752	752
query36	1026	1060	933	933
query37	104	92	69	69
query38	4112	4060	3970	3970
query39	1561	1498	1497	1497
query40	209	134	115	115
query41	58	50	47	47
query42	119	107	103	103
query43	507	511	482	482
query44	1375	833	839	833
query45	183	189	177	177
query46	919	1052	688	688
query47	2006	1989	1927	1927
query48	411	426	340	340
query49	735	513	429	429
query50	713	719	445	445
query51	7352	7299	7134	7134
query52	102	109	96	96
query53	245	266	199	199
query54	556	539	483	483
query55	84	79	79	79
query56	270	284	281	281
query57	1281	1269	1235	1235
query58	239	221	219	219
query59	3095	3101	3054	3054
query60	310	294	286	286
query61	116	113	122	113
query62	802	742	689	689
query63	233	214	214	214
query64	4209	1051	649	649
query65	3332	3307	3250	3250
query66	832	456	309	309
query67	16066	15803	15637	15637
query68	7797	835	547	547
query69	497	313	270	270
query70	1168	1122	1082	1082
query71	378	304	267	267
query72	5671	3809	3853	3809
query73	656	768	353	353
query74	10296	9346	9180	9180
query75	3259	3141	2675	2675
query76	3147	1182	766	766
query77	712	380	288	288
query78	10434	10478	9679	9679
query79	4470	868	592	592
query80	788	543	447	447
query81	506	254	214	214
query82	734	124	93	93
query83	174	165	146	146
query84	279	98	81	81
query85	783	365	303	303
query86	395	318	313	313
query87	4319	4364	4194	4194
query88	5262	2398	2480	2398
query89	414	332	304	304
query90	1806	189	190	189
query91	143	153	107	107
query92	63	62	54	54
query93	2904	895	535	535
query94	676	401	310	310
query95	346	279	274	274
query96	490	611	280	280
query97	3228	3284	3104	3104
query98	230	211	202	202
query99	1572	1403	1321	1321
Total cold run time: 296568 ms
Total hot run time: 192710 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.28 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit efc78a20d56ac846d6d9ac4e685b72c8d63b41d6, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.04
query3	0.23	0.07	0.06
query4	1.62	0.12	0.11
query5	0.54	0.52	0.51
query6	1.13	0.73	0.73
query7	0.02	0.01	0.02
query8	0.04	0.03	0.03
query9	0.57	0.52	0.50
query10	0.57	0.55	0.56
query11	0.15	0.10	0.11
query12	0.14	0.10	0.11
query13	0.62	0.60	0.59
query14	0.78	0.80	0.79
query15	0.84	0.84	0.83
query16	0.38	0.40	0.40
query17	1.08	1.03	1.00
query18	0.25	0.22	0.23
query19	1.96	1.78	1.80
query20	0.01	0.01	0.01
query21	15.39	0.93	0.58
query22	0.76	0.70	0.72
query23	15.16	1.49	0.52
query24	3.38	0.34	1.14
query25	0.27	0.09	0.11
query26	0.38	0.15	0.13
query27	0.05	0.07	0.04
query28	13.15	1.07	0.44
query29	12.59	3.89	3.25
query30	0.25	0.09	0.06
query31	2.82	0.61	0.37
query32	3.22	0.53	0.45
query33	3.02	3.01	3.06
query34	16.58	5.24	4.52
query35	4.56	4.60	4.63
query36	0.63	0.51	0.50
query37	0.08	0.06	0.06
query38	0.04	0.03	0.03
query39	0.04	0.02	0.02
query40	0.16	0.13	0.12
query41	0.07	0.02	0.03
query42	0.03	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 103.69 s
Total hot run time: 28.28 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 45.48% (12722/27973)
Line Coverage 36.36% (113428/311929)
Region Coverage 33.99% (64932/191008)
Branch Coverage 31.04% (34089/109812)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.46% (21030/27503)
Line Coverage 69.82% (217064/310872)
Region Coverage 67.77% (129938/191742)
Branch Coverage 61.33% (67648/110304)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.46% (21030/27503)
Line Coverage 69.82% (217064/310872)
Region Coverage 67.77% (129938/191742)
Branch Coverage 61.33% (67648/110304)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.46% (21030/27503)
Line Coverage 69.82% (217050/310872)
Region Coverage 67.77% (129935/191742)
Branch Coverage 61.33% (67645/110304)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.46% (21030/27503)
Line Coverage 69.83% (217071/310872)
Region Coverage 67.77% (129947/191742)
Branch Coverage 61.34% (67655/110304)

@morrySnow morrySnow merged commit fac9ba3 into branch-3.1 Sep 4, 2025
21 of 23 checks passed
@github-actions github-actions bot deleted the auto-pick-55092-branch-3.1 branch September 4, 2025 01:58
@morrySnow morrySnow mentioned this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants