-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve nop-match simplification #68481
Conversation
r? @zackmdavis (rust_highfive has picked a reviewer for you, use r? to override) |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't understood some parts, maybe you can clear up some of my confusion.
/// or | ||
/// | ||
/// ```rust | ||
/// StorageLive(_LOCAL_TMP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why we should be doing this optimization in debug mode (and why we shouldn't instead run the storage and temporary elimination)? Will this make debug
mode compile faster in all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply!
AFAIK mir-opt-level
is always 1
unless explicitly specified, regardless of release
or debug
build. mir-opt-level
is independent from optimization level, and cargo
does not set it. (-Zmir-optlevel
is an unstable flag and setting it to 2
or larger is prone to ICE/miscompile anyway.)
You can confirm this by building MIR of this playground in release
/ debug
build. Temporaries are not eliminated even in release
build, and SimplifyArmIdentity
is not working.
That said, the purpose of this PR is to improve the codegen of identical match
in relase
build.
diff
of cargo rustc --release -- --emit=llvm-ir
, using the same code as the playground above, before/after this PR
--- old.ll 2020-01-27 07:04:08.034979599 +0900
+++ new.ll 2020-01-27 07:04:21.682132255 +0900
@@ -1,5 +1,5 @@
-; ModuleID = 'foo.1lzogcxs-cgu.0'
-source_filename = "foo.1lzogcxs-cgu.0"
+; ModuleID = 'foo.2my2volq-cgu.0'
+source_filename = "foo.2my2volq-cgu.0"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -9,80 +9,35 @@
%"unwind::libunwind::_Unwind_Context" = type { [0 x i8] }
; foo::try_identity
-; Function Attrs: nofree norecurse nounwind nonlazybind uwtable
-define void @_ZN3foo12try_identity17hf7a3f301f9e3a548E(%"core::result::Result<u64, i32>"* noalias nocapture sret dereferenceable(16), %"core::result::Result<u64, i32>"* noalias nocapture readonly dereferenceable(16) %x) unnamed_addr #0 {
+; Function Attrs: nounwind nonlazybind uwtable
+define void @_ZN3foo12try_identity17h273ad7d1ef012f0eE(%"core::result::Result<u64, i32>"* noalias nocapture sret dereferenceable(16), %"core::result::Result<u64, i32>"* noalias nocapture readonly dereferenceable(16) %x) unnamed_addr #0 {
start:
- %1 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %x, i64 0, i32 0, i64 0
- %2 = load i32, i32* %1, align 8, !range !2
- %switch = icmp eq i32 %2, 1
- br i1 %switch, label %bb3, label %bb2
-
-bb2: ; preds = %start
- %3 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %x, i64 0, i32 2, i64 1
- %4 = bitcast i32* %3 to i64*
- %x2 = load i64, i64* %4, align 8
- %5 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %0, i64 0, i32 2, i64 1
- %6 = bitcast i32* %5 to i64*
- store i64 %x2, i64* %6, align 8
- br label %bb4
-
-bb3: ; preds = %start
- %7 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %x, i64 0, i32 2, i64 0
- %x1 = load i32, i32* %7, align 4
- %8 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %0, i64 0, i32 2, i64 0
- store i32 %x1, i32* %8, align 4
- br label %bb4
-
-bb4: ; preds = %bb2, %bb3
- %.sink = phi i32 [ 0, %bb2 ], [ 1, %bb3 ]
- %9 = getelementptr inbounds %"core::result::Result<u64, i32>", %"core::result::Result<u64, i32>"* %0, i64 0, i32 0, i64 0
- store i32 %.sink, i32* %9, align 8
+ %1 = bitcast %"core::result::Result<u64, i32>"* %0 to i8*
+ %2 = bitcast %"core::result::Result<u64, i32>"* %x to i8*
+ tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %1, i8* nonnull align 8 %2, i64 16, i1 false)
ret void
}
; foo::try_identity_drop
; Function Attrs: nounwind nonlazybind uwtable
-define void @_ZN3foo17try_identity_drop17h5d9ea5a8e7421c58E(%"core::result::Result<alloc::string::String, i32>"* noalias nocapture sret dereferenceable(32), %"core::result::Result<alloc::string::String, i32>"* noalias nocapture readonly dereferenceable(32) %x) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
+define void @_ZN3foo17try_identity_drop17h7d0574702d5fadd5E(%"core::result::Result<alloc::string::String, i32>"* noalias nocapture sret dereferenceable(32), %"core::result::Result<alloc::string::String, i32>"* noalias nocapture readonly dereferenceable(32) %x) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
- %1 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %x, i64 0, i32 0, i64 0
- %2 = load i32, i32* %1, align 8, !range !2
- %switch = icmp eq i32 %2, 1
- br i1 %switch, label %bb8, label %bb5
-
-bb4: ; preds = %bb5, %bb8
- %.sink = phi i32 [ 0, %bb5 ], [ 1, %bb8 ]
- %3 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %0, i64 0, i32 0, i64 0
- store i32 %.sink, i32* %3, align 8
+ %1 = bitcast %"core::result::Result<alloc::string::String, i32>"* %0 to i8*
+ %2 = bitcast %"core::result::Result<alloc::string::String, i32>"* %x to i8*
+ tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %1, i8* nonnull align 8 %2, i64 32, i1 false)
ret void
-
-bb5: ; preds = %start
- %x1.sroa.0.0..sroa_idx3 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %x, i64 0, i32 2, i64 1
- %x1.sroa.0.0..sroa_cast = bitcast i32* %x1.sroa.0.0..sroa_idx3 to i8*
- %_4.sroa.0.0..sroa_idx10 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %0, i64 0, i32 2, i64 1
- %_4.sroa.0.0..sroa_cast = bitcast i32* %_4.sroa.0.0..sroa_idx10 to i8*
- call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %_4.sroa.0.0..sroa_cast, i8* nonnull align 8 %x1.sroa.0.0..sroa_cast, i64 24, i1 false)
- br label %bb4
-
-bb8: ; preds = %start
- %4 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %x, i64 0, i32 2, i64 0
- %x2 = load i32, i32* %4, align 4
- %5 = getelementptr inbounds %"core::result::Result<alloc::string::String, i32>", %"core::result::Result<alloc::string::String, i32>"* %0, i64 0, i32 2, i64 0
- store i32 %x2, i32* %5, align 4
- br label %bb4
}
; Function Attrs: nounwind nonlazybind uwtable
-declare i32 @rust_eh_personality(i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*) unnamed_addr #1
+declare i32 @rust_eh_personality(i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*) unnamed_addr #0
; Function Attrs: argmemonly nounwind
-declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i1 immarg) #1
-attributes #0 = { nofree norecurse nounwind nonlazybind uwtable "probe-stack"="__rust_probestack" "target-cpu"="x86-64" }
-attributes #1 = { nounwind nonlazybind uwtable "probe-stack"="__rust_probestack" "target-cpu"="x86-64" }
-attributes #2 = { argmemonly nounwind }
+attributes #0 = { nounwind nonlazybind uwtable "probe-stack"="__rust_probestack" "target-cpu"="x86-64" }
+attributes #1 = { argmemonly nounwind }
!llvm.module.flags = !{!0, !1}
!0 = !{i32 7, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
-!2 = !{i32 0, i32 2}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the compiler performance: I locally run rustc-perf, and it looks perf-neutral. The 2nd point in #66234 (comment) is still required for these optimizations to be useful in real cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's two things I'm worried about here:
-
we're slowly adding more optimizations to debug mode, which by it's very definition should be easy to debug with a debugger. Such optimizations can easily end up messing up the source. This one doesn't do it yet, but still.
-
You essentially had to implement an optimization twice, once for mir-opt-level 1 and once for mir-opt-level 2. I feel like that needlessly complicates our optimizations.
Instead of implementing this optimization twice, how about moving the ICEing optimizations to mir-opt-level 3 and making mir-opt-level 2 the standard for --release
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're slowly adding more optimizations to debug mode, which by it's very definition should be easy to debug with a debugger. Such optimizations can easily end up messing up the source.
I realized this PR is already debuginfo-breaking. If the debugger was stepping on match
, it would point to whichever arm the optimizer reduced all other arms into.
Instead of implementing this optimization twice, how about moving the ICEing optimizations to mir-opt-level 3 and making mir-opt-level 2 the standard for --release?
Your plan sounds right. Efforts should be concentrated on fixing/improving advanced MIR optimizations rather than adding ad-hoc optimization passes.
I'll close this PR. Thank you for your time, @oli-obk!
@@ -169,19 +194,23 @@ fn match_arm(stmts: &mut [Statement<'_>], local_decls: &mut IndexVec<Local, Loca | |||
return; | |||
} | |||
|
|||
if vf_1 != vf_0 // The field-and-variant information match up. | |||
if drop_flag == Some(local_0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what these checks do. The type of local_0
is never bool
as far as I can tell, so how can local_0
and drop_flag
refer to the same local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this one is definitely unnecessary. I added it by mistake - sorry for confusion.
let mut matched_stmts = 0; | ||
let basic_blocks = body.basic_blocks_mut(); | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... this loop goes backwards in all preds
statements and gives you the number of statements that are the same? Basically extracting a common postfix of statements from all previous basic blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|| drop_flag == Some(local_tmp_2) | ||
|| local_decls[local_tmp_2].is_user_variable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I still don't understand what the above code is doing, I can't tell if removing the == Some(local_0)
is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== Some(local_0)
surely didn't make sense as you pointed out. Regarding other checks... I'm not sure. I added checks to make sure the code being matched is exactly what I intended, but probably they can all be removed.
SimplifyArmIdentity
not working in presence of temporaries and storage markes, which are not eliminated inmir-opt-level=1
. This does not fix the issue of?
not being eliminated inmir-opt-level < 2
.SinkCommonCodeFromPredecessors
pass. This allows optimizing identical matches on types that have destructors.This code ...
... gives you MIR with drop flag manipulation:
In this case
SimplifyBranchSame
does not work becausebb2
andbb3
is still unequal afterSimplifyArmIdentity
pass. It seems that LLVM is able to optimize thematch
intomemcpy
from this state, though.SinkCommonCodeFromPredecessors
is able to sink the common statement,_0 = move _1
, tobb8
.I stole the idea from LLVM, but this PR implements only (1) in the comment there.
#66234