-
Notifications
You must be signed in to change notification settings - Fork 13.3k
MIR too miraculous on ARM? #34177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Confirmation on NEON instructions being (not-)used or even the assembly file of the benchmark verbatim, would be very nice. |
Turns out there are two issues here. Firstly NEON is being used every time and Secondly, and that means old trans was the problem after all: Fast version: Profiling raytrace-a8fc715418690b23 with callgrind...
Total Instructions...342,469,841
252,629,865 (73.8%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
89,839,976 (26.2%) ???:std::time::Instant::now
----------------------------------------------------------------------- and the slow version: Profiling raytrace-a8fc715418690b23 with callgrind...
Total Instructions...673,450,376
373,645,740 (55.5%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
299,804,636 (44.5%) memset.S:memset
----------------------------------------------------------------------- |
Drop filling, almost guaranteed. Such things are very likely to make LLVM unable to autovectorize code that changes very little for various reasons. We got static drop in MIR now, which likely makes things easier for LLVM to see through and thus succeed in vectorising in both cases. Cursory googling seems to suggest that both |
And even if |
@petevine also, did you check that neon instructions are in fact emitted for |
That's
raytrace-a8fc715418690b23.s contains: vadd.f32 s5, s5, s12
vadd.f32 s1, s1, s1
vadd.f32 s3, s3, s3
vadd.f32 s5, s5, s5
vadd.f32 s1, s1, s12
vadd.f32 s3, s3, s12
vadd.f32 s5, s5, s12
vmul.f32 s7, s1, s1
vmul.f32 s9, s3, s3
vmul.f32 s11, s5, s5
vadd.f32 s7, s7, s9
vadd.f32 s7, s7, s11 Same for the slow version. How do you explain flipping the neon switch only affects old trans but does nothing for MIR? |
Most likely LLVM takes different decisions while optimising some sort of code depending on whether fpu is
No it doesn’t. It would however still use NEON instructions if the libstd was compiled with
Does changing order to Reopening, because we do not honour the request to not use NEON (even it ifs most likely a LLVM’s fault) |
Yeah, just clutching at straws. OK, looking at llvm-ir from old trans, the only crates that differ are |
llvm-ir does not include metadata and monomorphisations of generic code are done on-demand. Comparing llvm-ir of libraries, as opposed to llvm-ir of final binary, is not very interesting. |
The main chunk corresponding to the binary name is identical too - any suggestions? |
I gave up trying to cross-compile the thing and see what’s up. I still strongly suspect an LLVM bug, that’s pretty much all I can say ATM. |
I tried building master of that library to no-avail. On 2016-06-09 14:12:24-0700, petevine wrote:
|
These are not NEON instructions, they are standard scalar VFP instructions. |
Yeah, I had a brainfart from looking at some British humour of calling scalar fp instructions The title of this issue was meant to convey no NEON could have been used but then I actually |
I badly wanted to see stuff like this: vmul.f32 q11, q10, d4[0]
vadd.f32 q11, q11, q12 but it's nowhere to be found, and besides, the
Definitely an LLVM issue as choosing a better fpu option seems to impede performance. |
Removing the I-wrong label, because it seems like we aren’t emitting instructions we aren’t supposed to emit. I do not see any reason to keep this issue open anymore either. It seems to me like the only thing left in this issue is about LLVM optimisations not producing the desired instruction sequence (see previous comment)…
@petevine is there any objections to closing this issue? |
No problem; with a looming switch to MIR by default, an investigation of old trans performance hit is probably not worth the effort. |
I've just found some rather strange raytrace benchmark results:
Old trans:
-neon 578,445,347 +neon 314,162,592
ns/iter
so far so good, NEON is expected to provide about that much of a benefit on my particular machine. But using
-Z orbit
:-neon 264,841,360 +neon 264,789,659
ns/iter
Without looking at the assembly, it probably doesn't mean NEON is never used and old trans is a very sick puppy but rather
-Zorbit
plays loose with the combination of feature and cpu flags. (-neon
andcortex-a5
, respectively)The text was updated successfully, but these errors were encountered: