Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #55092

…ped segments (#55092)

### What problem does this PR solve?

Fix segment number mismatch caused by erroneously skipped segments
during concurrent incremental open on auto-partitioned table:

#### Problem
During concurrent incremental open on an auto-partitioned table, one
sink may incorrectly assume that stream opened by another sink have
already been opened and begin writing data while those segments are
still being opened. This leads to some segments being silently skipped
and results in a segment number mismatch. For example(two instances, 4
BEs: a, b, c, d):
| Time | Event |
| ---- |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| t0 | `sink1` and `sink2` start incremental open for BEs **a, b, c,
d**. |
| t1 | `sink1` adds **a, b, c** to `_load_stream_map` and initiates
open. |
| t2 | `sink2` adds **d** to `_load_stream_map` and initiates open. |
| t3 | `sink1` completes open for **a** and **b**; **c** is still in
progress. |
| t4 | `sink2` successfully opens **d**, assumes **a, b, c** are **all**
ready, and starts writing. Because **c** is not yet fully open, its
segments are skipped, causing the mismatch. |

#### Expected behavior
A sink must wait until all stream it depends on are fully opened before
starting any write.

#### Proposed fix
All sinks open the full set of streams (a, b, c, d) instead of a partial
subset. Lock on each stream guarantees that:
- Duplicate open attempts are prevented:only the first sink performs the
actual open; subsequent sinks wait until the open is complete.
- Expected behavior is preserved:every sink waits until all streams are
fully opened before starting any write, eliminating skipped segments and
the resulting segment-number mismatch.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@github-actions github-actions bot requested a review from dataroaring as a code owner August 29, 2025 07:35
@Thearas
Copy link
Contributor

Thearas commented Aug 29, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Aug 29, 2025
@Thearas
Copy link
Contributor

Thearas commented Aug 29, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39668 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3f970cf5c65e3b059ac811a4a79f1bfd218c499a, data reload: false

------ Round 1 ----------------------------------
q1	17900	6889	6629	6629
q2	2044	179	164	164
q3	10651	1099	1201	1099
q4	10417	702	735	702
q5	7719	2803	2770	2770
q6	211	129	129	129
q7	980	627	604	604
q8	9350	1975	1960	1960
q9	6635	6408	6442	6408
q10	7048	2285	2317	2285
q11	465	255	264	255
q12	403	211	208	208
q13	17783	2993	2969	2969
q14	234	212	207	207
q15	514	469	453	453
q16	481	380	382	380
q17	966	587	575	575
q18	7259	6718	6653	6653
q19	1406	1075	999	999
q20	489	204	204	204
q21	4048	3018	3149	3018
q22	1103	997	1020	997
Total cold run time: 108106 ms
Total hot run time: 39668 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6689	6632	6544	6544
q2	336	233	230	230
q3	2877	2920	2924	2920
q4	2023	1901	1810	1810
q5	5731	5752	5740	5740
q6	211	128	125	125
q7	2206	1785	1818	1785
q8	3348	3507	3585	3507
q9	8842	9028	8950	8950
q10	3615	3542	3592	3542
q11	592	511	500	500
q12	852	624	609	609
q13	7991	3172	3181	3172
q14	314	285	276	276
q15	504	469	459	459
q16	499	442	451	442
q17	1870	1651	1610	1610
q18	8302	7717	7715	7715
q19	1654	1556	1573	1556
q20	2124	1897	1888	1888
q21	5152	5129	5042	5042
q22	1149	1069	1020	1020
Total cold run time: 66881 ms
Total hot run time: 59442 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193244 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3f970cf5c65e3b059ac811a4a79f1bfd218c499a, data reload: false

query1	947	373	374	373
query2	6314	1867	1872	1867
query3	8689	207	206	206
query4	33924	23500	23586	23500
query5	4079	459	455	455
query6	291	177	199	177
query7	4206	311	336	311
query8	287	215	220	215
query9	9347	2606	2596	2596
query10	477	264	261	261
query11	18104	15718	15272	15272
query12	155	105	101	101
query13	1555	444	424	424
query14	9138	6676	6940	6676
query15	246	172	176	172
query16	7894	499	515	499
query17	1578	630	607	607
query18	2130	323	317	317
query19	221	176	170	170
query20	120	117	115	115
query21	200	108	110	108
query22	4622	4343	4351	4343
query23	35232	34686	34716	34686
query24	11856	2953	3017	2953
query25	659	431	439	431
query26	741	172	173	172
query27	2620	366	371	366
query28	6622	2161	2162	2161
query29	789	462	452	452
query30	264	156	156	156
query31	1085	814	825	814
query32	105	56	60	56
query33	776	314	332	314
query34	975	528	548	528
query35	904	727	740	727
query36	1090	967	945	945
query37	119	66	71	66
query38	4104	3921	3973	3921
query39	1519	1456	1521	1456
query40	210	97	97	97
query41	50	49	48	48
query42	113	112	103	103
query43	530	490	496	490
query44	1322	836	832	832
query45	191	169	173	169
query46	1188	783	758	758
query47	1980	1927	1944	1927
query48	474	379	391	379
query49	978	398	407	398
query50	854	443	435	435
query51	7390	7271	7315	7271
query52	103	93	90	90
query53	267	185	190	185
query54	1060	479	489	479
query55	79	79	87	79
query56	281	270	249	249
query57	1314	1202	1178	1178
query58	217	217	234	217
query59	3160	3071	3025	3025
query60	297	253	262	253
query61	107	112	106	106
query62	855	701	689	689
query63	228	195	195	195
query64	4065	676	647	647
query65	3361	3292	3293	3292
query66	762	302	297	297
query67	16332	15784	15558	15558
query68	4379	586	575	575
query69	428	255	265	255
query70	1113	1099	1142	1099
query71	357	265	257	257
query72	6306	4103	4085	4085
query73	755	347	353	347
query74	10155	9217	9226	9217
query75	3373	2631	2655	2631
query76	2820	1014	1067	1014
query77	396	287	283	283
query78	10621	9639	9726	9639
query79	2531	598	643	598
query80	1095	422	462	422
query81	545	219	217	217
query82	610	94	87	87
query83	228	142	138	138
query84	239	84	86	84
query85	1300	311	291	291
query86	456	288	298	288
query87	4390	4272	4237	4237
query88	4292	2408	2401	2401
query89	404	292	290	290
query90	1968	188	189	188
query91	181	150	150	150
query92	59	50	50	50
query93	2489	563	562	562
query94	792	297	263	263
query95	365	267	257	257
query96	632	288	282	282
query97	3307	3127	3151	3127
query98	233	206	193	193
query99	1517	1302	1305	1302
Total cold run time: 300088 ms
Total hot run time: 193244 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 42.08% (11242/26713)
Line Coverage 32.60% (96228/295141)
Region Coverage 30.54% (55255/180917)
Branch Coverage 26.86% (27344/101804)

@doris-robot
Copy link

ClickBench: Total hot run time: 29.55 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3f970cf5c65e3b059ac811a4a79f1bfd218c499a, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.03	0.03
query3	0.22	0.07	0.07
query4	1.63	0.10	0.10
query5	0.55	0.53	0.52
query6	1.14	0.73	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.02
query9	0.58	0.50	0.49
query10	0.56	0.55	0.55
query11	0.15	0.11	0.10
query12	0.14	0.11	0.10
query13	0.61	0.60	0.59
query14	0.78	0.81	0.80
query15	0.85	0.83	0.82
query16	0.41	0.37	0.39
query17	1.05	1.05	1.05
query18	0.24	0.23	0.23
query19	1.94	1.86	1.87
query20	0.02	0.01	0.01
query21	15.38	0.58	0.58
query22	2.31	2.54	1.82
query23	17.01	0.96	0.82
query24	3.69	0.57	0.13
query25	0.17	0.10	0.04
query26	0.30	0.14	0.14
query27	0.07	0.05	0.05
query28	11.49	0.52	0.51
query29	12.65	3.25	3.27
query30	0.25	0.07	0.06
query31	2.85	0.39	0.38
query32	3.25	0.47	0.46
query33	2.98	3.04	3.01
query34	17.18	4.62	4.48
query35	4.54	4.60	4.55
query36	0.69	0.48	0.49
query37	0.09	0.06	0.06
query38	0.05	0.03	0.04
query39	0.03	0.02	0.02
query40	0.15	0.13	0.13
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.02	0.02
Total cold run time: 106.31 s
Total hot run time: 29.55 s

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 75.17% (19700/26206)
Line Coverage 68.44% (201080/293826)
Region Coverage 66.51% (120394/181021)
Branch Coverage 59.85% (61123/102128)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 6f02bc9 into branch-3.0 Sep 5, 2025
23 of 26 checks passed
@github-actions github-actions bot deleted the auto-pick-55092-branch-3.0 branch September 5, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants