Skip to content

MIR too miraculous on ARM? #34177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MagaTailor opened this issue Jun 9, 2016 · 17 comments
Closed

MIR too miraculous on ARM? #34177

MagaTailor opened this issue Jun 9, 2016 · 17 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state

Comments

@MagaTailor
Copy link

MagaTailor commented Jun 9, 2016

I've just found some rather strange raytrace benchmark results:

Old trans:
-neon 578,445,347 +neon 314,162,592 ns/iter

so far so good, NEON is expected to provide about that much of a benefit on my particular machine. But using -Z orbit:

-neon 264,841,360 +neon 264,789,659 ns/iter

Without looking at the assembly, it probably doesn't mean NEON is never used and old trans is a very sick puppy but rather -Zorbit plays loose with the combination of feature and cpu flags. (-neon and cortex-a5, respectively)

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

Confirmation on NEON instructions being (not-)used or even the assembly file of the benchmark verbatim, would be very nice.

@nagisa nagisa added the A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html label Jun 9, 2016
@MagaTailor
Copy link
Author

MagaTailor commented Jun 9, 2016

Turns out there are two issues here. Firstly NEON is being used every time and RUSTFLAGS='-C target-feature=-neon' is ineffective when coupled with a cpu that supports vfpv4 (probably via +vfp4 and/or target-cpu); in other words it's not possible to use vfpv4 without NEON. (asm headers, however, show the difference: .fpu vfpv4 or .fpu neon-vfpv4)

Secondly, and that means old trans was the problem after all:

Fast version:

Profiling raytrace-a8fc715418690b23 with callgrind...

Total Instructions...342,469,841

252,629,865 (73.8%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
89,839,976 (26.2%) ???:std::time::Instant::now
-----------------------------------------------------------------------

and the slow version:

Profiling raytrace-a8fc715418690b23 with callgrind...

Total Instructions...673,450,376

373,645,740 (55.5%) ???:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
299,804,636 (44.5%) memset.S:memset
-----------------------------------------------------------------------

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

memset.S:memset

Drop filling, almost guaranteed. Such things are very likely to make LLVM unable to autovectorize code that changes very little for various reasons. We got static drop in MIR now, which likely makes things easier for LLVM to see through and thus succeed in vectorising in both cases.

Cursory googling seems to suggest that both vfpv4 and neon-vfpv4 are required to be with neon whereas only a vfpv4-d16 does not require one. Thus closing as “working as intended”. I might be wrong, though, so ask for reopen if that’s the case.

@nagisa nagisa closed this as completed Jun 9, 2016
@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

And even if vfpv4 is supposed to not, in fact, use neon, its most likely an LLVM bug and thus should be reported to them.

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

@petevine also, did you check that neon instructions are in fact emitted for vfpv4? I haven’t seen you explicitly stating that. Perhaps there were no autovectorisation in neither of the cases and MIR is just producing a better code?

@MagaTailor
Copy link
Author

That's +vfp4,+d16 in LLVM speak, however NEON is just an extension of vfpv4, which itself only adds FMA instructions, so it should be possible to separate the two. I wonder if the way libstd was bootstrapped has any bearing. And even if I do:

RUSTFLAGS='-C target-feature=-neon,+vfp4,+d16 -Z orbit --emit asm' cargo bench

raytrace-a8fc715418690b23.s contains:

        vadd.f32        s5, s5, s12
        vadd.f32        s1, s1, s1
        vadd.f32        s3, s3, s3
        vadd.f32        s5, s5, s5
        vadd.f32        s1, s1, s12
        vadd.f32        s3, s3, s12
        vadd.f32        s5, s5, s12
        vmul.f32        s7, s1, s1
        vmul.f32        s9, s3, s3
        vmul.f32        s11, s5, s5
        vadd.f32        s7, s7, s9
        vadd.f32        s7, s7, s11

Same for the slow version. How do you explain flipping the neon switch only affects old trans but does nothing for MIR?

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

How do you explain flipping the neon switch only affects old trans but does nothing for MIR?

Most likely LLVM takes different decisions while optimising some sort of code depending on whether fpu is neon-vfp4 or vfp4, whereas such pattern does not occur in MIR and thus there’s no such decision to make.

I wonder if the way libstd was bootstrapped has any bearing.

No it doesn’t. It would however still use NEON instructions if the libstd was compiled with +neon and your program with -neon.

RUSTFLAGS='-C target-feature=-neon,+vfp4,+d16 -Z orbit --emit asm' cargo bench

Does changing order to +vfp4,+d16,-neon help anyhow?

Reopening, because we do not honour the request to not use NEON (even it ifs most likely a LLVM’s fault)

@nagisa nagisa reopened this Jun 9, 2016
@nagisa nagisa added I-wrong A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html labels Jun 9, 2016
@MagaTailor
Copy link
Author

I wonder if the way libstd was bootstrapped has any bearing.

No it doesn’t. It would however still use NEON instructions if the libstd was compiled with +neon and your program with -neon.

Yeah, just clutching at straws. OK, looking at llvm-ir from old trans, the only crates that differ are libc and rand. The rest are identical line for line, so it's probably libc I should be attaching?

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

llvm-ir does not include metadata and monomorphisations of generic code are done on-demand. Comparing llvm-ir of libraries, as opposed to llvm-ir of final binary, is not very interesting.

@MagaTailor
Copy link
Author

The main chunk corresponding to the binary name is identical too - any suggestions?

@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

I gave up trying to cross-compile the thing and see what’s up. I still strongly suspect an LLVM bug, that’s pretty much all I can say ATM.

@pnkfelix pnkfelix added the A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html label Jun 9, 2016
@nagisa
Copy link
Member

nagisa commented Jun 9, 2016

I tried building master of that library to no-avail.

On 2016-06-09 14:12:24-0700, petevine wrote:

@nagisa Hope you noticed Cargo.toml needed updating? I'll delete this in a few minutes.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#34177 (comment)

@nagisa nagisa added the O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Jun 11, 2016
@Amanieu
Copy link
Member

Amanieu commented Jun 12, 2016

    vadd.f32        s5, s5, s12
    vadd.f32        s1, s1, s1
    vadd.f32        s3, s3, s3
    vadd.f32        s5, s5, s5
    vadd.f32        s1, s1, s12
    vadd.f32        s3, s3, s12
    vadd.f32        s5, s5, s12
    vmul.f32        s7, s1, s1
    vmul.f32        s9, s3, s3
    vmul.f32        s11, s5, s5
    vadd.f32        s7, s7, s9
    vadd.f32        s7, s7, s11

These are not NEON instructions, they are standard scalar VFP instructions.

@MagaTailor
Copy link
Author

MagaTailor commented Jun 12, 2016

Yeah, I had a brainfart from looking at some British humour of calling scalar fp instructions VectorFloatingPoint :)

The title of this issue was meant to convey no NEON could have been used but then I actually managed to find it and hence the needless distraction from the main issue. My bad!

@MagaTailor
Copy link
Author

I badly wanted to see stuff like this:

vmul.f32        q11, q10, d4[0]
vadd.f32        q11, q11, q12

but it's nowhere to be found, and besides, the ARMv6-vfp2 target produces the fastest code so far:

test bench ... bench: 239,727,289 ns/iter (+/- 274,873)

Definitely an LLVM issue as choosing a better fpu option seems to impede performance.

@nagisa nagisa removed the I-wrong label Jun 12, 2016
@nagisa
Copy link
Member

nagisa commented Jun 12, 2016

Removing the I-wrong label, because it seems like we aren’t emitting instructions we aren’t supposed to emit. I do not see any reason to keep this issue open anymore either. It seems to me like the only thing left in this issue is about LLVM optimisations not producing the desired instruction sequence (see previous comment)…

  1. which is not an uncommon problem with optimising compilers in general;
  2. something we cannot do very much about.

@petevine is there any objections to closing this issue?

@MagaTailor
Copy link
Author

No problem; with a looming switch to MIR by default, an investigation of old trans performance hit is probably not worth the effort.

@nagisa nagisa closed this as completed Jun 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state
Projects
None yet
Development

No branches or pull requests

4 participants