Skip to content

Conversation

@qidaye
Copy link
Contributor

@qidaye qidaye commented Nov 1, 2024

What problem does this PR solve?

When the term was deleted in its entirety, we incorrectly recorded information about the term with a doc frequency of zero. This results in redundant information in the index file. If many terms were deleted, the index file would be much larger than normal.
In this pr, we have removed the information for term with doc frequency 0.

Problem Summary:

Check List (For Committer)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No colde files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.
  • Release note

    bugfix: Skip writing terms with a doc frequency of 0

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@qidaye
Copy link
Contributor Author

qidaye commented Nov 1, 2024

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

std::string _curreent_dir;
};

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
^
Additional context

be/test/olap/rowset/segment_v2/inverted_index/compaction/index_compaction_with_deleted_term.cpp:560: 107 lines including whitespace and comments (threshold 80)

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
^

@qidaye qidaye force-pushed the index_compaction_with_deleted_term branch from 8715517 to 9456193 Compare November 1, 2024 14:12
@qidaye
Copy link
Contributor Author

qidaye commented Nov 1, 2024

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

std::string _curreent_dir;
};

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
^
Additional context

be/test/olap/rowset/segment_v2/inverted_index/compaction/index_compaction_with_deleted_term.cpp:561: 107 lines including whitespace and comments (threshold 80)

TEST_F(IndexCompactionDeleteTest, delete_index_test) {
^

@doris-robot
Copy link

TPC-H: Total hot run time: 41525 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9456193a1494a2ab26d5ec7d468eae51c69bb217, data reload: false

------ Round 1 ----------------------------------
q1	17610	7684	7303	7303
q2	2053	179	163	163
q3	10557	1080	1149	1080
q4	10250	862	835	835
q5	7770	3073	3109	3073
q6	235	150	150	150
q7	1021	600	598	598
q8	9346	1926	2029	1926
q9	6605	6423	6495	6423
q10	7081	2422	2417	2417
q11	458	259	263	259
q12	411	220	205	205
q13	17766	2995	3014	2995
q14	244	222	209	209
q15	579	531	516	516
q16	627	591	587	587
q17	1001	572	553	553
q18	7419	6770	6758	6758
q19	1324	1007	1088	1007
q20	496	182	181	181
q21	4104	3427	3251	3251
q22	1149	1036	1041	1036
Total cold run time: 108106 ms
Total hot run time: 41525 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7343	7300	7307	7300
q2	324	236	224	224
q3	3068	2957	2998	2957
q4	2111	1848	1795	1795
q5	5752	5804	5839	5804
q6	229	144	141	141
q7	2281	1836	1804	1804
q8	3411	3451	3544	3451
q9	8979	8945	8924	8924
q10	3613	3575	3566	3566
q11	608	521	509	509
q12	821	622	636	622
q13	11795	3203	3265	3203
q14	298	280	280	280
q15	573	518	518	518
q16	698	627	652	627
q17	1871	1625	1638	1625
q18	8411	7818	7736	7736
q19	1730	1623	1569	1569
q20	2090	1865	1884	1865
q21	5600	5571	5448	5448
q22	1220	1099	1077	1077
Total cold run time: 72826 ms
Total hot run time: 61045 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 197939 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9456193a1494a2ab26d5ec7d468eae51c69bb217, data reload: false

query1	1211	897	876	876
query2	6237	2215	2176	2176
query3	10911	4018	4176	4018
query4	68220	29363	23752	23752
query5	4973	469	459	459
query6	419	193	173	173
query7	6363	300	296	296
query8	315	238	235	235
query9	9331	2753	2759	2753
query10	471	282	258	258
query11	17811	15853	16020	15853
query12	161	117	104	104
query13	1563	450	443	443
query14	11383	7289	7405	7289
query15	199	179	175	175
query16	7286	488	499	488
query17	1068	588	582	582
query18	1803	322	309	309
query19	205	168	158	158
query20	123	116	115	115
query21	218	107	107	107
query22	5119	4801	4790	4790
query23	35607	34038	34109	34038
query24	5947	2734	2795	2734
query25	520	407	397	397
query26	658	161	158	158
query27	1700	287	290	287
query28	4023	2442	2417	2417
query29	677	434	425	425
query30	235	155	158	155
query31	986	782	783	782
query32	67	59	58	58
query33	451	276	282	276
query34	907	514	506	506
query35	851	754	730	730
query36	1070	972	930	930
query37	116	75	77	75
query38	4355	4249	4226	4226
query39	1507	1455	1442	1442
query40	203	101	104	101
query41	50	47	46	46
query42	114	98	98	98
query43	549	508	515	508
query44	1187	820	822	820
query45	190	166	176	166
query46	1121	689	690	689
query47	1931	1855	1853	1853
query48	424	331	334	331
query49	743	393	416	393
query50	807	403	411	403
query51	7369	7140	7186	7140
query52	101	91	88	88
query53	257	184	187	184
query54	532	421	418	418
query55	74	79	74	74
query56	254	239	242	239
query57	1297	1184	1162	1162
query58	216	207	206	206
query59	3215	3129	2974	2974
query60	272	254	246	246
query61	108	101	107	101
query62	795	696	670	670
query63	218	193	195	193
query64	1366	633	597	597
query65	3311	3213	3176	3176
query66	705	354	309	309
query67	16006	15760	15863	15760
query68	4491	551	582	551
query69	419	252	259	252
query70	1215	1149	1149	1149
query71	366	243	251	243
query72	6223	4063	3960	3960
query73	756	362	363	362
query74	10387	9314	8995	8995
query75	3388	2675	2713	2675
query76	1793	1034	1038	1034
query77	474	278	274	274
query78	10570	9440	9332	9332
query79	1099	583	586	583
query80	768	421	452	421
query81	512	243	239	239
query82	985	120	122	120
query83	220	146	144	144
query84	282	73	70	70
query85	850	290	295	290
query86	319	293	255	255
query87	5005	4670	4705	4670
query88	3490	2172	2170	2170
query89	408	291	281	281
query90	2065	189	189	189
query91	136	97	107	97
query92	59	52	47	47
query93	1071	526	531	526
query94	776	281	288	281
query95	333	266	255	255
query96	615	276	294	276
query97	2922	2694	2743	2694
query98	218	205	191	191
query99	1718	1338	1323	1323
Total cold run time: 322415 ms
Total hot run time: 197939 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.73 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9456193a1494a2ab26d5ec7d468eae51c69bb217, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.03
query3	0.23	0.06	0.06
query4	1.64	0.10	0.10
query5	0.41	0.41	0.41
query6	1.17	0.65	0.65
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.56	0.50	0.50
query10	0.55	0.54	0.55
query11	0.14	0.10	0.11
query12	0.14	0.12	0.11
query13	0.60	0.60	0.60
query14	2.71	2.78	2.78
query15	0.90	0.82	0.82
query16	0.38	0.37	0.38
query17	1.08	1.03	1.04
query18	0.20	0.20	0.19
query19	1.87	1.82	1.96
query20	0.02	0.01	0.00
query21	15.35	0.58	0.57
query22	2.72	2.04	2.59
query23	17.07	1.16	0.77
query24	2.71	1.53	0.94
query25	0.25	0.14	0.13
query26	0.44	0.14	0.13
query27	0.05	0.04	0.03
query28	10.51	1.11	1.07
query29	12.58	3.24	3.26
query30	0.24	0.06	0.05
query31	2.87	0.38	0.37
query32	3.27	0.46	0.45
query33	3.03	3.02	3.01
query34	16.96	4.45	4.48
query35	4.54	4.51	4.54
query36	0.65	0.48	0.48
query37	0.09	0.07	0.06
query38	0.04	0.03	0.03
query39	0.03	0.02	0.02
query40	0.16	0.13	0.12
query41	0.08	0.02	0.02
query42	0.04	0.02	0.02
query43	0.03	0.04	0.03
Total cold run time: 106.48 s
Total hot run time: 32.73 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.87% (9835/25969)
Line Coverage: 29.04% (81764/281568)
Region Coverage: 28.31% (42222/149155)
Branch Coverage: 24.88% (21424/86106)
Coverage Report: http://coverage.selectdb-in.cc/coverage/9456193a1494a2ab26d5ec7d468eae51c69bb217_9456193a1494a2ab26d5ec7d468eae51c69bb217/report/index.html

Copy link
Member

@airborne12 airborne12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 5, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2024

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2024

PR approved by anyone and no changes requested.

Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants