Skip to content

Conversation

@liaoxin01
Copy link
Contributor

@liaoxin01 liaoxin01 commented Sep 2, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #54395

Problem Summary:

The _rs_metas and _rs_version_map information in tmp_tablet meta are inconsistent, causing the attempt to fetch rowset by version to fail (getting null pointer). The tmp_tablet meta was copied from new tablet, and its rowset information is actually useless since the real rowset data will be obtained later through sync rowset. The sync rowset operation failed to remove the old rowsets, resulting in this inconsistency. We need to first clean up the obsolete rowsets in tmp_tablet meta.

*** SIGSEGV address not mapped to object (@0x38) received by PID 2824014 (TID 2824488 OR 0x7f59e8eff640) from PID 56; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F5B3AC5F520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::cloud::CloudMetaMgr::fill_version_holes(doris::CloudTablet*, long, std::unique_lockstd::shared_mutex&) at /home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:1650
5# doris::cloud::CloudMetaMgr::sync_tablet_rowsets_unlocked(doris::CloudTablet*, std::unique_lockbthread::Mutex&, doris::SyncOptions const&, doris::SyncRowsetStats*) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
6# doris::cloud::CloudMetaMgr::sync_tablet_rowsets(doris::CloudTablet*, doris::SyncOptions const&, doris::SyncRowsetStats*) at /home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:477
7# doris::CloudSchemaChangeJob::_process_delete_bitmap(long, long, long) at /home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:519
8# doris::CloudSchemaChangeJob::_convert_historical_rowsets(doris::SchemaChangeParams const&, doris::cloud::TabletJobInfoPB&) at /home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:424
9# doris::CloudSchemaChangeJob::process_alter_tablet(doris::TAlterTabletReqV2 const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
10# doris::alter_cloud_tablet_callback(doris::CloudStorageEngine&, doris::TAgentTaskRequest const&) at /home/zcp/repo_center/doris_master/doris/be/src/agent/task_worker_pool.cpp:2176
11# std::_Function_handler<void (), doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest const&)::$_0::operator()<doris::TAgentTaskRequest const&>(doris::TAgentTaskRequest const&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
12# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_master/doris/be/src/util/threadpool.cpp:621
13# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_master/doris/be/src/util/thread.cpp:461

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33993 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5c2e1be317e831d7282aa4d440f39d01c7da185d, data reload: false

------ Round 1 ----------------------------------
q1	17596	5246	5069	5069
q2	2020	323	209	209
q3	10215	1316	710	710
q4	10250	1096	542	542
q5	7606	2453	2361	2361
q6	194	169	136	136
q7	948	747	635	635
q8	9378	1351	1088	1088
q9	7165	5072	5164	5072
q10	6963	2403	1994	1994
q11	482	302	273	273
q12	373	368	225	225
q13	17786	3661	3101	3101
q14	243	243	225	225
q15	563	504	483	483
q16	433	437	402	402
q17	587	875	357	357
q18	7695	7200	6957	6957
q19	1520	956	582	582
q20	345	327	236	236
q21	3806	3187	2376	2376
q22	1067	1024	960	960
Total cold run time: 107235 ms
Total hot run time: 33993 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5269	5118	5096	5096
q2	254	326	234	234
q3	2179	2678	2327	2327
q4	1384	1828	1332	1332
q5	4210	4448	4571	4448
q6	219	172	135	135
q7	2044	2025	1856	1856
q8	2693	2663	2663	2663
q9	7406	7416	7326	7326
q10	3101	3387	2903	2903
q11	573	527	497	497
q12	724	824	602	602
q13	3603	4170	3205	3205
q14	279	308	279	279
q15	530	492	512	492
q16	463	506	451	451
q17	1187	1614	1341	1341
q18	8041	7738	7521	7521
q19	810	797	869	797
q20	2081	2054	1937	1937
q21	5059	4562	4400	4400
q22	1084	1048	996	996
Total cold run time: 53193 ms
Total hot run time: 50838 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186684 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5c2e1be317e831d7282aa4d440f39d01c7da185d, data reload: false

query1	1070	430	409	409
query2	6578	1782	1747	1747
query3	6757	224	219	219
query4	26548	23582	22856	22856
query5	4434	694	536	536
query6	327	248	236	236
query7	4676	515	301	301
query8	306	270	248	248
query9	8642	2930	2945	2930
query10	499	378	299	299
query11	16042	15036	14848	14848
query12	180	126	125	125
query13	1684	569	454	454
query14	9581	5838	5769	5769
query15	223	183	168	168
query16	7683	672	517	517
query17	1208	750	684	684
query18	2055	439	338	338
query19	210	199	174	174
query20	136	127	119	119
query21	244	129	113	113
query22	3991	4221	3962	3962
query23	34010	32809	32953	32809
query24	8094	2366	2445	2366
query25	594	516	447	447
query26	1249	292	167	167
query27	2715	513	356	356
query28	4367	2305	2297	2297
query29	781	637	482	482
query30	290	226	207	207
query31	953	815	719	719
query32	95	78	87	78
query33	575	387	359	359
query34	795	876	524	524
query35	820	842	744	744
query36	1012	1016	927	927
query37	127	113	91	91
query38	4075	4022	4042	4022
query39	1482	1425	1438	1425
query40	227	134	126	126
query41	63	62	65	62
query42	133	115	114	114
query43	514	526	480	480
query44	1360	858	856	856
query45	184	183	171	171
query46	875	1021	646	646
query47	1751	1799	1722	1722
query48	396	454	330	330
query49	768	514	413	413
query50	671	692	407	407
query51	4120	4178	4083	4083
query52	124	117	110	110
query53	254	275	204	204
query54	619	600	542	542
query55	99	91	95	91
query56	339	341	341	341
query57	1178	1211	1119	1119
query58	295	284	281	281
query59	2615	2776	2703	2703
query60	380	372	378	372
query61	193	191	192	191
query62	825	747	702	702
query63	239	201	200	200
query64	4625	1280	828	828
query65	4358	4234	4250	4234
query66	1108	446	350	350
query67	15622	15438	15144	15144
query68	8439	924	581	581
query69	491	331	294	294
query70	1291	1155	1144	1144
query71	574	349	325	325
query72	6056	5052	5182	5052
query73	774	691	370	370
query74	9275	8936	9079	8936
query75	3846	3103	2628	2628
query76	3701	1191	742	742
query77	798	405	323	323
query78	9570	9723	8927	8927
query79	2497	841	600	600
query80	715	587	519	519
query81	479	257	224	224
query82	466	145	110	110
query83	288	258	249	249
query84	296	117	104	104
query85	915	473	482	473
query86	339	329	302	302
query87	4277	4365	4239	4239
query88	2816	2194	2191	2191
query89	414	325	301	301
query90	1932	223	228	223
query91	189	167	133	133
query92	89	73	81	73
query93	1372	981	652	652
query94	691	441	316	316
query95	418	343	330	330
query96	488	604	287	287
query97	2618	2698	2614	2614
query98	250	219	219	219
query99	1439	1431	1309	1309
Total cold run time: 276846 ms
Total hot run time: 186684 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.04 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5c2e1be317e831d7282aa4d440f39d01c7da185d, data reload: false

query1	0.06	0.04	0.04
query2	0.09	0.05	0.05
query3	0.26	0.08	0.08
query4	1.63	0.11	0.11
query5	0.43	0.43	0.42
query6	1.18	0.63	0.67
query7	0.04	0.03	0.03
query8	0.06	0.05	0.05
query9	0.60	0.53	0.52
query10	0.58	0.58	0.56
query11	0.17	0.11	0.11
query12	0.15	0.12	0.12
query13	0.63	0.62	0.62
query14	0.80	0.83	0.86
query15	0.87	0.85	0.86
query16	0.40	0.41	0.39
query17	1.07	1.06	1.09
query18	0.22	0.20	0.20
query19	1.91	1.85	1.84
query20	0.01	0.01	0.02
query21	15.43	0.95	0.59
query22	0.75	1.21	0.97
query23	14.69	1.42	0.64
query24	6.59	0.81	0.86
query25	0.51	0.19	0.11
query26	0.60	0.17	0.13
query27	0.07	0.06	0.05
query28	9.97	0.93	0.42
query29	12.55	3.87	3.22
query30	3.08	3.02	2.97
query31	2.83	0.59	0.39
query32	3.24	0.56	0.47
query33	3.12	3.10	3.23
query34	16.16	5.52	4.86
query35	4.87	4.90	4.91
query36	0.69	0.51	0.51
query37	0.10	0.07	0.08
query38	0.07	0.05	0.04
query39	0.04	0.03	0.03
query40	0.19	0.15	0.15
query41	0.08	0.03	0.04
query42	0.04	0.03	0.03
query43	0.06	0.03	0.04
Total cold run time: 106.89 s
Total hot run time: 33.04 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.72% (17147/33155)
Line Coverage 37.23% (156724/420955)
Region Coverage 31.88% (119588/375144)
Branch Coverage 33.24% (52536/158067)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.61% (22999/32571)
Line Coverage 56.96% (239716/420827)
Region Coverage 52.45% (199646/380661)
Branch Coverage 54.12% (86070/159048)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 42861e6 into apache:master Sep 3, 2025
31 of 34 checks passed
uchenily pushed a commit to uchenily/doris that referenced this pull request Sep 5, 2025
…rowset_metadata (apache#55604)

Related PR: apache#54395

Problem Summary:

The _rs_metas and _rs_version_map information in tmp_tablet meta are
inconsistent, causing the attempt to fetch rowset by version to fail
(getting null pointer). The tmp_tablet meta was copied from new tablet,
and its rowset information is actually useless since the real rowset
data will be obtained later through sync rowset. The sync rowset
operation failed to remove the old rowsets, resulting in this
inconsistency. We need to first clean up the obsolete rowsets in
tmp_tablet meta.


*** SIGSEGV address not mapped to object (@0x38) received by PID 2824014
(TID 2824488 OR 0x7f59e8eff640) from PID 56; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F5B3AC5F520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::cloud::CloudMetaMgr::fill_version_holes(doris::CloudTablet*,
long, std::unique_lock<std::shared_mutex>&) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:1650
5#
doris::cloud::CloudMetaMgr::sync_tablet_rowsets_unlocked(doris::CloudTablet*,
std::unique_lock<bthread::Mutex>&, doris::SyncOptions const&,
doris::SyncRowsetStats*) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
6# doris::cloud::CloudMetaMgr::sync_tablet_rowsets(doris::CloudTablet*,
doris::SyncOptions const&, doris::SyncRowsetStats*) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:477
7# doris::CloudSchemaChangeJob::_process_delete_bitmap(long, long, long)
at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:519
8#
doris::CloudSchemaChangeJob::_convert_historical_rowsets(doris::SchemaChangeParams
const&, doris::cloud::TabletJobInfoPB&) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:424
9#
doris::CloudSchemaChangeJob::process_alter_tablet(doris::TAlterTabletReqV2
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
10# doris::alter_cloud_tablet_callback(doris::CloudStorageEngine&,
doris::TAgentTaskRequest const&) at
/home/zcp/repo_center/doris_master/doris/be/src/agent/task_worker_pool.cpp:2176
11# std::_Function_handler<void (),
doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest
const&)::$_0::operator()<doris::TAgentTaskRequest
const&>(doris::TAgentTaskRequest const&)
const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at
/usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
12# doris::ThreadPool::dispatch_thread() at
/home/zcp/repo_center/doris_master/doris/be/src/util/threadpool.cpp:621
13# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_master/doris/be/src/util/thread.cpp:461
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Sep 12, 2025
…rowset_metadata (apache#55604)

Related PR: apache#54395

Problem Summary:

The _rs_metas and _rs_version_map information in tmp_tablet meta are
inconsistent, causing the attempt to fetch rowset by version to fail
(getting null pointer). The tmp_tablet meta was copied from new tablet,
and its rowset information is actually useless since the real rowset
data will be obtained later through sync rowset. The sync rowset
operation failed to remove the old rowsets, resulting in this
inconsistency. We need to first clean up the obsolete rowsets in
tmp_tablet meta.

*** SIGSEGV address not mapped to object (@0x38) received by PID 2824014
(TID 2824488 OR 0x7f59e8eff640) from PID 56; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F5B3AC5F520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::cloud::CloudMetaMgr::fill_version_holes(doris::CloudTablet*,
long, std::unique_lock<std::shared_mutex>&) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:1650
5#
doris::cloud::CloudMetaMgr::sync_tablet_rowsets_unlocked(doris::CloudTablet*,
std::unique_lock<bthread::Mutex>&, doris::SyncOptions const&,
doris::SyncRowsetStats*) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
6# doris::cloud::CloudMetaMgr::sync_tablet_rowsets(doris::CloudTablet*,
doris::SyncOptions const&, doris::SyncRowsetStats*) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_meta_mgr.cpp:477
7# doris::CloudSchemaChangeJob::_process_delete_bitmap(long, long, long)
at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:519
8#
doris::CloudSchemaChangeJob::_convert_historical_rowsets(doris::SchemaChangeParams
const&, doris::cloud::TabletJobInfoPB&) at
/home/zcp/repo_center/doris_master/doris/be/src/cloud/cloud_schema_change_job.cpp:424
9#
doris::CloudSchemaChangeJob::process_alter_tablet(doris::TAlterTabletReqV2
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
10# doris::alter_cloud_tablet_callback(doris::CloudStorageEngine&,
doris::TAgentTaskRequest const&) at
/home/zcp/repo_center/doris_master/doris/be/src/agent/task_worker_pool.cpp:2176
11# std::_Function_handler<void (),
doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest
const&)::$_0::operator()<doris::TAgentTaskRequest
const&>(doris::TAgentTaskRequest const&)
const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at
/usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
12# doris::ThreadPool::dispatch_thread() at
/home/zcp/repo_center/doris_master/doris/be/src/util/threadpool.cpp:621
13# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_master/doris/be/src/util/thread.cpp:461
@morrySnow morrySnow mentioned this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants