Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Sep 3, 2025

What problem does this PR solve?

Currently, MetaServiceImpl::get_rowset use calc_sync_versions to eliminate unnecessary version ranges when BE sync rowset metas. One of the optimizations is as the following:

std::vector<std::pair<int64_t, int64_t>> calc_sync_versions(int64_t req_bc_cnt, int64_t bc_cnt,
                                                            int64_t req_cc_cnt, int64_t cc_cnt,
                                                            int64_t req_cp, int64_t cp,
                                                            int64_t req_start, int64_t req_end) {
    // ...
    if (req_cc_cnt < cc_cnt) {
        Version cc_version;
        if (req_cp < cp && req_cc_cnt + 1 == cc_cnt) {
            // * only one CC happened and CP changed
            // BE  [=][=][=][=][=====][=][=]
            //                  ^~~~~ req_cp
            // MS  [=][=][=][=][xxxxxxxxxxxxxx][=======][=][=]
            //                                  ^~~~~~~ ms_cp
            //                  ^____________^ related_versions: [req_cp, ms_cp - 1]
            //
            cc_version = {req_cp, cp - 1};
        } else {
    // ...
}

This optimization replies on the assumption that only cumulative compaction will change the cumulative point. However, full compaction can also change the cumulative point, which breaks the above replied assumption. This will cause data correctness problem in multi-cluster environment because it will make the tablet failed to sync some rowset metas forever.

A data correctness problem has been observed in the following situaitions:

  1. For a certain tablet, base_compaction_cnt=14, cumulative_compaction_cnt=804, cumu_point=7458.
    On node A of the write cluster (cluster 0), a full compaction of [2-7464] and a cumulative compaction of [7465-7486] were performed. The stats then became base_compaction_cnt=15, cumulative_compaction_cnt=805, cumu_point=7465.
  2. On node B of the read cluster (cluster 1), during sync_rowset, we have:
    req_base_compaction_cnt=14, base_compaction_cnt=15,
    req_cumulative_compaction_cnt=804, cumulative_compaction_cnt=805,
    req_cp=7458, cp=7465,
    req_start=7487, req_end=int_max.
  3. calc_sync_version computes that the rowsets to be pulled are [0-7464] and [7487-int_max], but it misses the rowset [7465-7486] produced by cumulative compaction.
  4. Moreover, since the max_version of the tablet on cluster 1 node B has been updated, subsequent sync_rowset operations will also not pull the rowset [7465-7486].
  5. This causes duplicate keys problem on MOW table because new rowset will generate delete bitmap marks on [7465-7486].

This PR forbids the above optimization when full compaction cnt is changed.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Sep 3, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 changed the title [Fix](cloud) calc_sync_version should consider full compaction [Fix](cloud) calc_sync_versions should consider full compaction Sep 3, 2025
@bobhan1 bobhan1 force-pushed the fix-calc_sync_versions-by-full-comp branch 4 times, most recently from c0d4807 to a89fabd Compare September 3, 2025 11:14
@bobhan1 bobhan1 force-pushed the fix-calc_sync_versions-by-full-comp branch from a89fabd to f840a83 Compare September 3, 2025 11:24
@bobhan1 bobhan1 marked this pull request as ready for review September 3, 2025 11:24
@bobhan1
Copy link
Contributor Author

bobhan1 commented Sep 3, 2025

run buildall

@bobhan1
Copy link
Contributor Author

bobhan1 commented Sep 3, 2025

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 34425 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7a478eeea431660b8653c60198389aa90842e194, data reload: false

------ Round 1 ----------------------------------
q1	17032	5248	5127	5127
q2	1988	358	215	215
q3	10015	1339	747	747
q4	10229	1015	533	533
q5	7598	2491	2374	2374
q6	185	168	140	140
q7	941	770	655	655
q8	9343	1356	1212	1212
q9	7022	5146	5224	5146
q10	6948	2406	1971	1971
q11	506	303	277	277
q12	359	362	242	242
q13	17797	3719	3054	3054
q14	232	245	223	223
q15	582	510	490	490
q16	424	430	381	381
q17	621	896	371	371
q18	7747	7071	7237	7071
q19	1099	985	597	597
q20	355	343	235	235
q21	4025	3276	2374	2374
q22	1054	1066	990	990
Total cold run time: 106102 ms
Total hot run time: 34425 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5223	5176	5198	5176
q2	250	333	242	242
q3	2155	2720	2309	2309
q4	1381	1858	1342	1342
q5	4244	4467	4655	4467
q6	227	168	129	129
q7	2038	1965	1891	1891
q8	2784	2678	2590	2590
q9	7481	7321	7269	7269
q10	3219	3333	2907	2907
q11	571	516	501	501
q12	679	765	629	629
q13	3587	3936	3424	3424
q14	311	310	285	285
q15	536	492	481	481
q16	554	542	451	451
q17	1169	1519	1397	1397
q18	7879	7725	7660	7660
q19	866	917	1005	917
q20	2224	2046	1899	1899
q21	5007	4395	4398	4395
q22	1111	1053	1021	1021
Total cold run time: 53496 ms
Total hot run time: 51382 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 187810 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7a478eeea431660b8653c60198389aa90842e194, data reload: false

query1	1051	484	419	419
query2	6557	1752	1781	1752
query3	6751	226	225	225
query4	25825	23216	23038	23038
query5	4375	665	510	510
query6	355	233	233	233
query7	4651	522	307	307
query8	300	251	251	251
query9	8654	2951	2906	2906
query10	479	371	341	341
query11	16041	15008	14807	14807
query12	172	116	124	116
query13	1667	577	416	416
query14	8755	6015	5949	5949
query15	216	211	174	174
query16	7188	699	537	537
query17	1270	773	666	666
query18	2025	485	334	334
query19	208	197	180	180
query20	136	126	124	124
query21	207	133	113	113
query22	4074	4337	4063	4063
query23	34064	33128	33350	33128
query24	8113	2418	2467	2418
query25	579	526	467	467
query26	1270	300	172	172
query27	2714	555	380	380
query28	4322	2295	2279	2279
query29	821	620	538	538
query30	298	230	195	195
query31	899	827	733	733
query32	91	84	82	82
query33	576	406	357	357
query34	818	863	545	545
query35	823	851	769	769
query36	972	1063	917	917
query37	126	115	90	90
query38	4141	4159	4043	4043
query39	1489	1457	1435	1435
query40	229	138	139	138
query41	66	63	67	63
query42	135	131	125	125
query43	521	526	475	475
query44	1402	889	871	871
query45	184	183	173	173
query46	871	1029	672	672
query47	1778	1803	1747	1747
query48	392	424	316	316
query49	753	548	453	453
query50	691	687	419	419
query51	4175	4184	4097	4097
query52	126	130	109	109
query53	254	286	209	209
query54	627	627	558	558
query55	99	90	91	90
query56	345	342	337	337
query57	1195	1226	1124	1124
query58	315	300	307	300
query59	2584	2745	2615	2615
query60	379	387	369	369
query61	197	192	194	192
query62	816	734	687	687
query63	250	212	207	207
query64	4615	1257	924	924
query65	4313	4195	4242	4195
query66	1173	459	350	350
query67	15485	15162	15337	15162
query68	9039	950	583	583
query69	516	404	301	301
query70	1241	1179	1188	1179
query71	565	365	327	327
query72	5637	5030	5206	5030
query73	774	678	369	369
query74	9071	9067	8799	8799
query75	4308	3150	2655	2655
query76	3818	1177	779	779
query77	812	442	332	332
query78	9553	9850	8947	8947
query79	2078	828	619	619
query80	677	623	516	516
query81	484	269	298	269
query82	312	145	120	120
query83	299	275	256	256
query84	302	119	93	93
query85	845	464	430	430
query86	348	336	314	314
query87	4268	4290	4247	4247
query88	2870	2261	2241	2241
query89	409	338	299	299
query90	2063	237	236	236
query91	157	185	139	139
query92	93	80	74	74
query93	1293	1009	649	649
query94	697	415	341	341
query95	461	351	330	330
query96	497	602	288	288
query97	2652	2734	2566	2566
query98	248	228	224	224
query99	1393	1393	1272	1272
Total cold run time: 274823 ms
Total hot run time: 187810 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.71 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7a478eeea431660b8653c60198389aa90842e194, data reload: false

query1	0.06	0.05	0.05
query2	0.11	0.06	0.06
query3	0.30	0.08	0.07
query4	1.60	0.09	0.10
query5	0.44	0.43	0.42
query6	1.18	0.66	0.66
query7	0.03	0.03	0.02
query8	0.07	0.05	0.05
query9	0.66	0.53	0.55
query10	0.60	0.60	0.60
query11	0.26	0.12	0.12
query12	0.25	0.14	0.13
query13	0.66	0.65	0.63
query14	0.82	0.85	0.84
query15	0.96	0.88	0.89
query16	0.40	0.40	0.39
query17	1.11	1.12	1.07
query18	0.23	0.22	0.23
query19	1.94	1.85	1.87
query20	0.01	0.01	0.02
query21	15.41	1.01	0.71
query22	0.93	1.16	0.90
query23	14.70	1.56	0.82
query24	5.26	0.62	0.35
query25	0.17	0.10	0.10
query26	0.56	0.22	0.17
query27	0.10	0.09	0.09
query28	11.07	1.20	0.57
query29	12.56	4.01	3.36
query30	3.12	3.07	2.98
query31	2.83	0.63	0.44
query32	3.24	0.60	0.50
query33	3.12	3.23	3.15
query34	16.62	5.50	4.77
query35	4.88	4.85	4.87
query36	0.66	0.54	0.52
query37	0.22	0.20	0.18
query38	0.19	0.17	0.16
query39	0.05	0.05	0.05
query40	0.21	0.19	0.18
query41	0.11	0.07	0.06
query42	0.06	0.06	0.06
query43	0.06	0.05	0.05
Total cold run time: 107.82 s
Total hot run time: 33.71 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 42.31% (11/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.83% (17204/33191)
Line Coverage 37.27% (157136/421627)
Region Coverage 31.92% (120012/376000)
Branch Coverage 33.29% (52668/158219)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (26/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.64% (23024/32592)
Line Coverage 56.99% (240134/421345)
Region Coverage 52.41% (199848/381351)
Branch Coverage 54.06% (86013/159103)

@bobhan1
Copy link
Contributor Author

bobhan1 commented Sep 4, 2025

run buildall

@bobhan1 bobhan1 requested a review from gavinchou September 4, 2025 07:45
@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 34423 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit da04f6340501fc9ddaf5e6cad1f574b03435c1c7, data reload: false

------ Round 1 ----------------------------------
q1	17011	5349	5132	5132
q2	2015	368	207	207
q3	9967	1285	722	722
q4	10222	1037	535	535
q5	7578	2522	2317	2317
q6	192	174	141	141
q7	958	799	634	634
q8	9348	1341	1154	1154
q9	7066	5099	5174	5099
q10	6966	2376	1978	1978
q11	503	308	292	292
q12	370	358	248	248
q13	17804	3701	3200	3200
q14	250	238	221	221
q15	598	517	493	493
q16	437	447	395	395
q17	617	874	365	365
q18	7604	7043	7184	7043
q19	1106	959	579	579
q20	367	342	251	251
q21	4122	3249	2438	2438
q22	1093	1048	979	979
Total cold run time: 106194 ms
Total hot run time: 34423 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5332	5129	5136	5129
q2	254	330	237	237
q3	2204	2721	2296	2296
q4	1341	1807	1297	1297
q5	4214	4432	4531	4432
q6	227	172	133	133
q7	2078	1931	1811	1811
q8	2702	2547	2610	2547
q9	7457	7358	7372	7358
q10	3208	3366	2921	2921
q11	593	525	493	493
q12	700	787	633	633
q13	3584	3897	3311	3311
q14	303	317	292	292
q15	535	488	489	488
q16	469	502	451	451
q17	1214	1619	1383	1383
q18	7720	7689	7766	7689
q19	828	807	816	807
q20	1910	2018	1820	1820
q21	4747	4355	4320	4320
q22	1095	1083	1029	1029
Total cold run time: 52715 ms
Total hot run time: 50877 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 187117 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit da04f6340501fc9ddaf5e6cad1f574b03435c1c7, data reload: false

query1	1031	420	403	403
query2	6576	1807	1769	1769
query3	6785	236	235	235
query4	26002	23501	23100	23100
query5	4427	634	526	526
query6	336	244	225	225
query7	4664	502	312	312
query8	312	248	252	248
query9	8645	2928	2897	2897
query10	502	357	306	306
query11	15987	14954	14825	14825
query12	202	122	116	116
query13	1669	556	424	424
query14	9347	5805	5822	5805
query15	209	190	215	190
query16	7402	638	495	495
query17	1223	747	653	653
query18	2007	430	345	345
query19	211	194	171	171
query20	132	123	120	120
query21	225	133	124	124
query22	4179	4128	4072	4072
query23	34053	32973	32916	32916
query24	8158	2365	2395	2365
query25	586	556	517	517
query26	1246	279	178	178
query27	2720	507	363	363
query28	4298	2266	2241	2241
query29	796	613	502	502
query30	290	249	189	189
query31	914	824	743	743
query32	95	77	86	77
query33	589	389	359	359
query34	798	864	534	534
query35	816	823	782	782
query36	938	1025	908	908
query37	125	111	103	103
query38	4016	4032	3945	3945
query39	1487	1432	1443	1432
query40	234	140	132	132
query41	64	62	66	62
query42	133	122	121	121
query43	532	503	489	489
query44	1352	863	853	853
query45	186	187	177	177
query46	865	1033	653	653
query47	1811	1859	1754	1754
query48	389	427	327	327
query49	775	545	439	439
query50	657	692	404	404
query51	4101	4187	4146	4146
query52	122	118	103	103
query53	251	276	206	206
query54	613	616	546	546
query55	95	92	93	92
query56	340	342	329	329
query57	1233	1214	1151	1151
query58	294	295	280	280
query59	2596	2695	2651	2651
query60	363	367	355	355
query61	165	162	162	162
query62	824	718	725	718
query63	243	198	197	197
query64	4542	1303	935	935
query65	4274	4245	4215	4215
query66	1169	469	370	370
query67	15400	15316	15166	15166
query68	8087	933	595	595
query69	506	355	317	317
query70	1189	1193	1180	1180
query71	577	431	333	333
query72	5854	4992	5044	4992
query73	685	633	360	360
query74	9146	9206	8839	8839
query75	3858	3269	2619	2619
query76	3711	1152	724	724
query77	808	449	340	340
query78	9586	9844	8795	8795
query79	2632	862	603	603
query80	714	584	502	502
query81	497	267	227	227
query82	413	143	119	119
query83	302	268	250	250
query84	310	114	96	96
query85	886	515	431	431
query86	355	321	315	315
query87	4354	4306	4232	4232
query88	3338	2234	2215	2215
query89	399	336	313	313
query90	1954	232	230	230
query91	170	165	136	136
query92	92	78	78	78
query93	1561	1009	659	659
query94	697	399	331	331
query95	412	335	325	325
query96	491	575	279	279
query97	2668	2746	2615	2615
query98	258	219	224	219
query99	1432	1439	1300	1300
Total cold run time: 275563 ms
Total hot run time: 187117 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.53 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit da04f6340501fc9ddaf5e6cad1f574b03435c1c7, data reload: false

query1	0.05	0.05	0.04
query2	0.11	0.06	0.06
query3	0.30	0.07	0.08
query4	1.60	0.10	0.09
query5	0.42	0.42	0.42
query6	1.17	0.64	0.64
query7	0.03	0.02	0.02
query8	0.07	0.05	0.05
query9	0.64	0.52	0.52
query10	0.60	0.59	0.59
query11	0.25	0.15	0.12
query12	0.26	0.14	0.14
query13	0.65	0.65	0.64
query14	0.83	0.84	0.84
query15	0.97	0.88	0.87
query16	0.39	0.39	0.39
query17	1.07	1.05	1.09
query18	0.23	0.22	0.23
query19	1.95	1.85	1.81
query20	0.02	0.01	0.02
query21	15.44	1.00	0.71
query22	0.92	1.14	0.86
query23	14.70	1.49	0.77
query24	5.22	0.63	0.36
query25	0.16	0.10	0.09
query26	0.56	0.21	0.18
query27	0.09	0.09	0.09
query28	11.21	1.16	0.57
query29	12.55	3.96	3.33
query30	3.17	3.04	3.06
query31	2.81	0.62	0.43
query32	3.26	0.59	0.51
query33	3.05	3.08	3.16
query34	16.59	5.50	4.78
query35	4.86	4.90	4.92
query36	0.64	0.52	0.51
query37	0.22	0.19	0.18
query38	0.19	0.16	0.16
query39	0.06	0.05	0.05
query40	0.21	0.18	0.17
query41	0.11	0.06	0.06
query42	0.08	0.06	0.06
query43	0.06	0.05	0.06
Total cold run time: 107.77 s
Total hot run time: 33.53 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 42.31% (11/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.91% (17246/33226)
Line Coverage 37.31% (157417/421909)
Region Coverage 31.97% (120232/376135)
Branch Coverage 33.31% (52718/158262)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (26/26) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.69% (23062/32626)
Line Coverage 57.02% (240416/421632)
Region Coverage 52.39% (199870/381522)
Branch Coverage 54.06% (86040/159162)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit d98c82d into apache:master Sep 4, 2025
26 of 28 checks passed
bobhan1 added a commit to bobhan1/doris that referenced this pull request Sep 5, 2025
…ache#55630)

Currently, `MetaServiceImpl::get_rowset` use `calc_sync_versions` to
eliminate unnecessary version ranges when BE sync rowset metas. One of
the optimizations is as the following:
```cpp
std::vector<std::pair<int64_t, int64_t>> calc_sync_versions(int64_t req_bc_cnt, int64_t bc_cnt,
                                                            int64_t req_cc_cnt, int64_t cc_cnt,
                                                            int64_t req_cp, int64_t cp,
                                                            int64_t req_start, int64_t req_end) {
    // ...
    if (req_cc_cnt < cc_cnt) {
        Version cc_version;
        if (req_cp < cp && req_cc_cnt + 1 == cc_cnt) {
            // * only one CC happened and CP changed
            // BE  [=][=][=][=][=====][=][=]
            //                  ^~~~~ req_cp
            // MS  [=][=][=][=][xxxxxxxxxxxxxx][=======][=][=]
            //                                  ^~~~~~~ ms_cp
            //                  ^____________^ related_versions: [req_cp, ms_cp - 1]
            //
            cc_version = {req_cp, cp - 1};
        } else {
    // ...
}
```
This optimization replies on the assumption that only cumulative
compaction will change the cumulative point. However, full compaction
can also change the cumulative point, which breaks the above replied
assumption. This will cause data correctness problem in multi-cluster
environment because it will make the tablet failed to sync some rowset
metas forever.

A data correctness problem has been observed in the following
situaitions:

1. For a certain tablet, base_compaction_cnt=14,
cumulative_compaction_cnt=804, cumu_point=7458.
On node A of the write cluster (cluster 0), a full compaction of
[2-7464] and a cumulative compaction of [7465-7486] were performed. The
stats then became base_compaction_cnt=15, cumulative_compaction_cnt=805,
cumu_point=7465.
2. On node B of the read cluster (cluster 1), during sync_rowset, we
have:
req_base_compaction_cnt=14, base_compaction_cnt=15,
req_cumulative_compaction_cnt=804, cumulative_compaction_cnt=805,
req_cp=7458, cp=7465,
req_start=7487, req_end=int_max.
3. calc_sync_version computes that the rowsets to be pulled are [0-7464]
and [7487-int_max], but it misses the rowset [7465-7486] produced by
cumulative compaction.
4. Moreover, since the max_version of the tablet on cluster 1 node B has
been updated, subsequent sync_rowset operations will also not pull the
rowset [7465-7486].
5. This causes duplicate keys problem on MOW table because new rowset
will generate delete bitmap marks on [7465-7486].

---
This PR forbids the above optimization when full compaction cnt is
changed.

None

- Test <!-- At least one of them must be included. -->
    - [x] Regression test
    - [x] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
bobhan1 added a commit to bobhan1/doris that referenced this pull request Sep 5, 2025
…ache#55630)

Currently, `MetaServiceImpl::get_rowset` use `calc_sync_versions` to
eliminate unnecessary version ranges when BE sync rowset metas. One of
the optimizations is as the following:
```cpp
std::vector<std::pair<int64_t, int64_t>> calc_sync_versions(int64_t req_bc_cnt, int64_t bc_cnt,
                                                            int64_t req_cc_cnt, int64_t cc_cnt,
                                                            int64_t req_cp, int64_t cp,
                                                            int64_t req_start, int64_t req_end) {
    // ...
    if (req_cc_cnt < cc_cnt) {
        Version cc_version;
        if (req_cp < cp && req_cc_cnt + 1 == cc_cnt) {
            // * only one CC happened and CP changed
            // BE  [=][=][=][=][=====][=][=]
            //                  ^~~~~ req_cp
            // MS  [=][=][=][=][xxxxxxxxxxxxxx][=======][=][=]
            //                                  ^~~~~~~ ms_cp
            //                  ^____________^ related_versions: [req_cp, ms_cp - 1]
            //
            cc_version = {req_cp, cp - 1};
        } else {
    // ...
}
```
This optimization replies on the assumption that only cumulative
compaction will change the cumulative point. However, full compaction
can also change the cumulative point, which breaks the above replied
assumption. This will cause data correctness problem in multi-cluster
environment because it will make the tablet failed to sync some rowset
metas forever.

A data correctness problem has been observed in the following
situaitions:

1. For a certain tablet, base_compaction_cnt=14,
cumulative_compaction_cnt=804, cumu_point=7458.
On node A of the write cluster (cluster 0), a full compaction of
[2-7464] and a cumulative compaction of [7465-7486] were performed. The
stats then became base_compaction_cnt=15, cumulative_compaction_cnt=805,
cumu_point=7465.
2. On node B of the read cluster (cluster 1), during sync_rowset, we
have:
req_base_compaction_cnt=14, base_compaction_cnt=15,
req_cumulative_compaction_cnt=804, cumulative_compaction_cnt=805,
req_cp=7458, cp=7465,
req_start=7487, req_end=int_max.
3. calc_sync_version computes that the rowsets to be pulled are [0-7464]
and [7487-int_max], but it misses the rowset [7465-7486] produced by
cumulative compaction.
4. Moreover, since the max_version of the tablet on cluster 1 node B has
been updated, subsequent sync_rowset operations will also not pull the
rowset [7465-7486].
5. This causes duplicate keys problem on MOW table because new rowset
will generate delete bitmap marks on [7465-7486].

---
This PR forbids the above optimization when full compaction cnt is
changed.

None

- Test <!-- At least one of them must be included. -->
    - [x] Regression test
    - [x] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
dataroaring pushed a commit that referenced this pull request Sep 5, 2025
wenzhenghu pushed a commit to wenzhenghu/doris that referenced this pull request Sep 8, 2025
…ache#55630)

### What problem does this PR solve?

Currently, `MetaServiceImpl::get_rowset` use `calc_sync_versions` to
eliminate unnecessary version ranges when BE sync rowset metas. One of
the optimizations is as the following:
```cpp
std::vector<std::pair<int64_t, int64_t>> calc_sync_versions(int64_t req_bc_cnt, int64_t bc_cnt,
                                                            int64_t req_cc_cnt, int64_t cc_cnt,
                                                            int64_t req_cp, int64_t cp,
                                                            int64_t req_start, int64_t req_end) {
    // ...
    if (req_cc_cnt < cc_cnt) {
        Version cc_version;
        if (req_cp < cp && req_cc_cnt + 1 == cc_cnt) {
            // * only one CC happened and CP changed
            // BE  [=][=][=][=][=====][=][=]
            //                  ^~~~~ req_cp
            // MS  [=][=][=][=][xxxxxxxxxxxxxx][=======][=][=]
            //                                  ^~~~~~~ ms_cp
            //                  ^____________^ related_versions: [req_cp, ms_cp - 1]
            //
            cc_version = {req_cp, cp - 1};
        } else {
    // ...
}
```
This optimization replies on the assumption that only cumulative
compaction will change the cumulative point. However, full compaction
can also change the cumulative point, which breaks the above replied
assumption. This will cause data correctness problem in multi-cluster
environment because it will make the tablet failed to sync some rowset
metas forever.

A data correctness problem has been observed in the following
situaitions:

1. For a certain tablet, base_compaction_cnt=14,
cumulative_compaction_cnt=804, cumu_point=7458.
On node A of the write cluster (cluster 0), a full compaction of
[2-7464] and a cumulative compaction of [7465-7486] were performed. The
stats then became base_compaction_cnt=15, cumulative_compaction_cnt=805,
cumu_point=7465.
2. On node B of the read cluster (cluster 1), during sync_rowset, we
have:
req_base_compaction_cnt=14, base_compaction_cnt=15,
req_cumulative_compaction_cnt=804, cumulative_compaction_cnt=805,
req_cp=7458, cp=7465,
req_start=7487, req_end=int_max.
3. calc_sync_version computes that the rowsets to be pulled are [0-7464]
and [7487-int_max], but it misses the rowset [7465-7486] produced by
cumulative compaction.
4. Moreover, since the max_version of the tablet on cluster 1 node B has
been updated, subsequent sync_rowset operations will also not pull the
rowset [7465-7486].
5. This causes duplicate keys problem on MOW table because new rowset
will generate delete bitmap marks on [7465-7486].

---
This PR forbids the above optimization when full compaction cnt is
changed.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [x] Regression test
    - [x] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@morrySnow morrySnow mentioned this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.9-merged dev/3.1.1-merged p0_w reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants