MIR too miraculous on ARM? #34177

MagaTailor · 2016-06-09T13:48:42Z

I've just found some rather strange raytrace benchmark results:

Old trans:
-neon 578,445,347 +neon 314,162,592 ns/iter

so far so good, NEON is expected to provide about that much of a benefit on my particular machine. But using -Z orbit:

-neon 264,841,360 +neon 264,789,659 ns/iter

Without looking at the assembly, it probably doesn't mean NEON is never used and old trans is a very sick puppy but rather -Zorbit plays loose with the combination of feature and cpu flags. (-neon and cortex-a5, respectively)

The text was updated successfully, but these errors were encountered:

nagisa · 2016-06-09T14:10:04Z

Confirmation on NEON instructions being (not-)used or even the assembly file of the benchmark verbatim, would be very nice.

MagaTailor · 2016-06-09T15:42:31Z

Turns out there are two issues here. Firstly NEON is being used every time and RUSTFLAGS='-C target-feature=-neon' is ineffective when coupled with a cpu that supports vfpv4 (probably via +vfp4 and/or target-cpu); in other words it's not possible to use vfpv4 without NEON. (asm headers, however, show the difference: .fpu vfpv4 or .fpu neon-vfpv4)

Secondly, and that means old trans was the problem after all:

Fast version:

Profiling raytrace-a8fc715418690b23 with callgrind...

Total Instructions...342,469,841

252,629,865 (73.8%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
89,839,976 (26.2%) ???:std::time::Instant::now
-----------------------------------------------------------------------

and the slow version:

Profiling raytrace-a8fc715418690b23 with callgrind...

Total Instructions...673,450,376

373,645,740 (55.5%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
299,804,636 (44.5%) memset.S:memset
-----------------------------------------------------------------------

nagisa · 2016-06-09T16:19:11Z

memset.S:memset

Drop filling, almost guaranteed. Such things are very likely to make LLVM unable to autovectorize code that changes very little for various reasons. We got static drop in MIR now, which likely makes things easier for LLVM to see through and thus succeed in vectorising in both cases.

Cursory googling seems to suggest that both vfpv4 and neon-vfpv4 are required to be with neon whereas only a vfpv4-d16 does not require one. Thus closing as “working as intended”. I might be wrong, though, so ask for reopen if that’s the case.

nagisa · 2016-06-09T16:21:25Z

And even if vfpv4 is supposed to not, in fact, use neon, its most likely an LLVM bug and thus should be reported to them.

nagisa · 2016-06-09T16:25:18Z

@petevine also, did you check that neon instructions are in fact emitted for vfpv4? I haven’t seen you explicitly stating that. Perhaps there were no autovectorisation in neither of the cases and MIR is just producing a better code?

MagaTailor · 2016-06-09T17:02:29Z

That's +vfp4,+d16 in LLVM speak, however NEON is just an extension of vfpv4, which itself only adds FMA instructions, so it should be possible to separate the two. I wonder if the way libstd was bootstrapped has any bearing. And even if I do:

RUSTFLAGS='-C target-feature=-neon,+vfp4,+d16 -Z orbit --emit asm' cargo bench

raytrace-a8fc715418690b23.s contains:

        vadd.f32        s5, s5, s12
        vadd.f32        s1, s1, s1
        vadd.f32        s3, s3, s3
        vadd.f32        s5, s5, s5
        vadd.f32        s1, s1, s12
        vadd.f32        s3, s3, s12
        vadd.f32        s5, s5, s12
        vmul.f32        s7, s1, s1
        vmul.f32        s9, s3, s3
        vmul.f32        s11, s5, s5
        vadd.f32        s7, s7, s9
        vadd.f32        s7, s7, s11

Same for the slow version. How do you explain flipping the neon switch only affects old trans but does nothing for MIR?

nagisa · 2016-06-09T17:39:46Z

How do you explain flipping the neon switch only affects old trans but does nothing for MIR?

Most likely LLVM takes different decisions while optimising some sort of code depending on whether fpu is neon-vfp4 or vfp4, whereas such pattern does not occur in MIR and thus there’s no such decision to make.

I wonder if the way libstd was bootstrapped has any bearing.

No it doesn’t. It would however still use NEON instructions if the libstd was compiled with +neon and your program with -neon.

RUSTFLAGS='-C target-feature=-neon,+vfp4,+d16 -Z orbit --emit asm' cargo bench

Does changing order to +vfp4,+d16,-neon help anyhow?

Reopening, because we do not honour the request to not use NEON (even it ifs most likely a LLVM’s fault)

MagaTailor · 2016-06-09T19:20:04Z

I wonder if the way libstd was bootstrapped has any bearing.

No it doesn’t. It would however still use NEON instructions if the libstd was compiled with +neon and your program with -neon.

Yeah, just clutching at straws. OK, looking at llvm-ir from old trans, the only crates that differ are libc and rand. The rest are identical line for line, so it's probably libc I should be attaching?

nagisa · 2016-06-09T19:25:27Z

llvm-ir does not include metadata and monomorphisations of generic code are done on-demand. Comparing llvm-ir of libraries, as opposed to llvm-ir of final binary, is not very interesting.

MagaTailor · 2016-06-09T19:33:13Z

The main chunk corresponding to the binary name is identical too - any suggestions?

nagisa · 2016-06-09T20:27:01Z

I gave up trying to cross-compile the thing and see what’s up. I still strongly suspect an LLVM bug, that’s pretty much all I can say ATM.

nagisa · 2016-06-09T21:48:14Z

I tried building master of that library to no-avail.

On 2016-06-09 14:12:24-0700, petevine wrote:

@nagisa Hope you noticed Cargo.toml needed updating? I'll delete this in a few minutes.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#34177 (comment)

Amanieu · 2016-06-12T05:49:59Z

    vadd.f32        s5, s5, s12
    vadd.f32        s1, s1, s1
    vadd.f32        s3, s3, s3
    vadd.f32        s5, s5, s5
    vadd.f32        s1, s1, s12
    vadd.f32        s3, s3, s12
    vadd.f32        s5, s5, s12
    vmul.f32        s7, s1, s1
    vmul.f32        s9, s3, s3
    vmul.f32        s11, s5, s5
    vadd.f32        s7, s7, s9
    vadd.f32        s7, s7, s11

These are not NEON instructions, they are standard scalar VFP instructions.

MagaTailor · 2016-06-12T11:16:35Z

Yeah, I had a brainfart from looking at some British humour of calling scalar fp instructions VectorFloatingPoint :)

The title of this issue was meant to convey no NEON could have been used but then I actually ~~managed~~ to find it and hence the needless distraction from the main issue. My bad!

MagaTailor · 2016-06-12T21:29:48Z

I badly wanted to see stuff like this:

vmul.f32        q11, q10, d4[0]
vadd.f32        q11, q11, q12

but it's nowhere to be found, and besides, the ARMv6-vfp2 target produces the fastest code so far:

test bench ... bench: 239,727,289 ns/iter (+/- 274,873)

Definitely an LLVM issue as choosing a better fpu option seems to impede performance.

nagisa · 2016-06-12T22:44:16Z

Removing the I-wrong label, because it seems like we aren’t emitting instructions we aren’t supposed to emit. I do not see any reason to keep this issue open anymore either. It seems to me like the only thing left in this issue is about LLVM optimisations not producing the desired instruction sequence (see previous comment)…

which is not an uncommon problem with optimising compilers in general;
something we cannot do very much about.

@petevine is there any objections to closing this issue?

MagaTailor · 2016-06-12T22:58:22Z

No problem; with a looming switch to MIR by default, an investigation of old trans performance hit is probably not worth the effort.

nagisa added the A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html label Jun 9, 2016

nagisa closed this as completed Jun 9, 2016

nagisa reopened this Jun 9, 2016

nagisa added I-wrong A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html labels Jun 9, 2016

pnkfelix added the A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html label Jun 9, 2016

sanmai-NL mentioned this issue Jun 10, 2016

docs: add GitHub issue template #33164

Closed

nagisa added the O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Jun 11, 2016

nagisa removed the I-wrong label Jun 12, 2016

nagisa closed this as completed Jun 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIR too miraculous on ARM? #34177

MIR too miraculous on ARM? #34177

MagaTailor commented Jun 9, 2016 •

edited

Loading

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016 •

edited

Loading

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016 •

edited

Loading

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

Amanieu commented Jun 12, 2016

MagaTailor commented Jun 12, 2016 •

edited

Loading

MagaTailor commented Jun 12, 2016

nagisa commented Jun 12, 2016

MagaTailor commented Jun 12, 2016

MIR too miraculous on ARM? #34177

MIR too miraculous on ARM? #34177

Comments

MagaTailor commented Jun 9, 2016 • edited Loading

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016 • edited Loading

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016 • edited Loading

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016

MagaTailor commented Jun 9, 2016

nagisa commented Jun 9, 2016

nagisa commented Jun 9, 2016

Amanieu commented Jun 12, 2016

MagaTailor commented Jun 12, 2016 • edited Loading

MagaTailor commented Jun 12, 2016

nagisa commented Jun 12, 2016

MagaTailor commented Jun 12, 2016

MagaTailor commented Jun 9, 2016 •

edited

Loading

MagaTailor commented Jun 9, 2016 •

edited

Loading

nagisa commented Jun 9, 2016 •

edited

Loading

MagaTailor commented Jun 12, 2016 •

edited

Loading