Hungry for more SIMD vectorization #7687

timholy · 2014-07-21T13:44:21Z

The first of these vectorizes (given Matrix{Float32} inputs), but the second doesn't.

function myblur_y!(blur_y, A)
    size(blur_y) == size(A) || error("mismatch")
    interior_x, interior_y = 2:size(A,1)-1, 2:size(A,2)-1
    @inbounds begin
        for y = interior_y, x = 1:size(A,1)
            blur_y[x,y] = A[x,y-1]+A[x,y]+A[x,y+1]
        end
    end
end

function myblur_x!(out, blur_y)
    size(out) == size(blur_y) || error("mismatch")
    interior_x, interior_y = 2:size(out,1)-1, 2:size(out,2)-1
    @inbounds begin
        for y = interior_y, x = interior_x
            out[x,y] = (1.0/9)*(blur_y[x-1,y] + blur_y[x,y] + blur_y[x+1,y])
        end
    end
end

However, the following vectorizes with a Vector{Float32} input:

function testvec2(lead)
    lag = similar(lead, length(lead)+1)
    @inbounds begin
        for i = 1:length(lead)
            lag[i+1] = 2*lead[i]
        end
    end
    lag
end

This makes me wonder whether there might be something relatively simple needed to turn on vectorization for myblur_x!?

Also, it would be great to be able to vectorize for Float64, although I recognize that the incremental benefit will be smaller.

CC @ArchRobison.

The text was updated successfully, but these errors were encountered:

timholy · 2014-07-21T13:50:00Z

Hah. I turns out the lack of vectorization in myblur_x! was entirely due to the 1.0/9 term. Changing that to one(eltype(out))/9 fixes it.

@ArchRobison, should I leave this open for the Float64 issue, or just close?

ArchRobison · 2014-07-21T14:24:20Z

Let's leave it open. I tried a similar C++ example with the Intel compiler and it vectorized the code, so presumably in principle Julia could vectorize it too. I'll try it with LLVM trunk to see if it does any better.

ArchRobison · 2014-07-21T16:01:36Z

The LLVM trunk vectorizer rejects the code because Julia codegen is generating volatile loads/stores in the loop. Here is a trivial example:

julia> TimesOne( a ) =  a*1.0
TimesOne (generic function with 1 method)

julia> code_llvm(TimesOne,(Float32,))

; Function Attrs: sspreq
define double @julia_TimesOne18932(float) #0 {
top:
  store volatile float %0, float* bitcast (i128* @jl_float_temp to float*), align 8, !dbg !8
  %1 = load volatile float* bitcast (i128* @jl_float_temp to float*), align 8, !dbg !8
  %2 = fpext float %1 to double, !dbg !8
  ret double %2, !dbg !8
}

The problem is in src/intrinsics.cpp:

    HANDLE(fpext,2) {
        // when extending a float32 to a float64, we need to force
        // rounding to single precision first. the reason is that it's
        // fine to keep working in extended precision as long as it's
        // understood that everything is implicitly rounded to 23 bits,
        // but if we start looking at more bits we need to actually do the
        // rounding first instead of carrying around incorrect low bits.
        Value *x = auto_unbox(args[2],ctx);
        builder.CreateStore(FP(x), builder.CreateBitCast(prepare_global(jlfloattemp_var),FT(x->getType())->getPointerTo
        return builder.CreateFPExt(builder.CreateLoad(builder.CreateBitCast(prepare_global(jlfloattemp_var),FT(x->getTy
                                   FTnbits(try_to_determine_bitstype_nbits(args[1],ctx)));
    }

This approach for rounding was introduced to fix issue #41. I suspect there is a cleaner way to enforce the rounding in LLVM. I'll go see what Clang does in similar circumstances. For example, C99 has a mode where extra precision is not allowed.

lindahua · 2014-07-21T16:17:03Z

What about we emit more efficient codes when the source codes are tagged with @simd?

ArchRobison · 2014-07-21T16:28:03Z

I'd like to fix the problem in general since it impacts optimization of scalar code too. For example, the volatile sequence foils constant folding.

JeffBezanson · 2014-07-21T16:31:13Z

Issue #41 seemed to only happen on a 32-bit system; indeed I can't reproduce it now on a 64-bit system. Also @Keno pointed out that bitcast behaves as if the value were stored to memory, so we should try using bitcast instead of the volatile store/load:

http://llvm.org/docs/LangRef.html#bitcast-to-instruction

ArchRobison · 2014-07-21T16:39:39Z

64-bit systems tend to forsake the x87 unit except in rare circumstances. I'd like to be able to reproduce the original problem before trying the bitcast fix. Is there an easy way to configure Julia on a 64-bit system to build as a 32-bit executable?

timholy · 2014-07-21T17:15:53Z

While it would be great to fix these mixed-type operations, I bet in practice many people will use computations that are all Float64 or a mix of Float64 and Int. AFAICT those are still not vectorizing: compare code_llvm on testvec2 with Vector{Float64} vs Vector{Float32} . Unless I'm really misunderstanding something, I'm guessing this must be a separate issue from the rounding behavior.

ArchRobison · 2014-07-21T19:06:27Z

The failure of LLVM to vectorize Float64 is a separate issue. I've been playing with LLVM trunk on a machine with AVX, and the problem sees to be in LLVM's cost model. It considers a scalar load of Float64 to have a cost of 1, but a 4-way SIMD load of Float64 to have a cost of 51. That seems way out of proportion.

staticfloat · 2014-07-21T20:58:59Z

@ArchRobison to build julia as a 32-bit executable, I think you'll just need to pass -m32 as a CFLAG to julia and all dependencies. If you'd like to install dependencies as 32-bit so that you don't have to bother with recompiling everything, you can usually request it. Here's how you'd do it on Ubuntu/Debian: sudo apt-get install llvm-3.3:i386.

rickhg12hs · 2014-07-22T04:06:36Z

As a related aside ...
@ArchRobison @staticfloat Building everything with -m32 on Travis might also catch some of the occasional 32-bit issues.

ArchRobison · 2014-07-22T14:47:46Z

It's the passing -m32 to all dependencies that the problem. There' are lots of dependencies. I tried setting CFLAGS = -m32 in Make.user but the compilation of deps/libunwind-1.1 stopped, from some kind of confusion between 32-bit and 64-bits. I'll poke around to see if I can find a 32-bit environment here.

staticfloat · 2014-07-22T15:02:35Z

You'll probably have to make distclean all your dependencies. You may also just need to export CFLAGS=-m32 in your shell, as variables getting passed down from Make.user to dependencies is not automatic, and we may have missed a few spots.

JeffBezanson · 2014-07-22T15:14:03Z

Starting a 32-bit VM is probably easier.

staticfloat · 2014-07-22T15:44:06Z

@ArchRobison I have some 32-bit VMs already up and running, if you need them. I can email you ssh usernames and passwords if you want.

ArchRobison · 2014-07-22T17:13:36Z

@staticfloat Thanks for the offer. I may take it up if the few 32-bit machines/VMs that I found here don't work out.

staticfloat · 2014-07-22T17:23:43Z

No problem. I've been setting up VMs specifically to build Julia all week, so the pathways have already been setup over here.

Incidentally, as I was trying to find your github account, I discovered that Google knows where your true loyalties lie, and even sorts them appropriately: :P

…ore/load to remove extra bits.

vtjnash · 2014-07-23T18:40:02Z

If you set ARCH=i686 (on a clean checkout), Julia will configure and build for 32-bit i686, assuming you have all the 32-bit dependencies. This is a very recent addition, and I think it is mentioned in the readme. You can optionally also set MARCH to further refine the target selection.

…g#41 to use volatile store/load on 32-bit x86 only.

Fix issue #7687, by revising original fix for issue #41

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 22, 2014

Fix for issue JuliaLang#7687, by using bitcast instead of volatile st…

698aa38

…ore/load to remove extra bits.

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014

Fix issue JuliaLang#7687, by revising original fix for issue JuliaLan…

08767a4

…g#41 to use volatile store/load on 32-bit x86 only.

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014

Fix issue JuliaLang#7687, by revising original fix for issue JuliaLan…

448f86d

…g#41 to use volatile store/load on 32-bit x86 only.

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014

Fix issue JuliaLang#7687, by revising original fix for issue JuliaLan…

3f77ae0

…g#41 to use volatile store/load on 32-bit x86 only.

ArchRobison mentioned this issue Jul 24, 2014

Fix issue #7687, by revising original fix for issue #41 #7719

Merged

JeffBezanson added a commit that referenced this issue Jul 24, 2014

Merge pull request #7719 from ArchRobison/adr/fpext

9d922bc

Fix issue #7687, by revising original fix for issue #41

JeffBezanson closed this as completed Jul 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hungry for more SIMD vectorization #7687

Hungry for more SIMD vectorization #7687

timholy commented Jul 21, 2014

timholy commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

lindahua commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

JeffBezanson commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

timholy commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

staticfloat commented Jul 21, 2014

rickhg12hs commented Jul 22, 2014

ArchRobison commented Jul 22, 2014

staticfloat commented Jul 22, 2014

JeffBezanson commented Jul 22, 2014

staticfloat commented Jul 22, 2014

ArchRobison commented Jul 22, 2014

staticfloat commented Jul 22, 2014

vtjnash commented Jul 23, 2014

Hungry for more SIMD vectorization #7687

Hungry for more SIMD vectorization #7687

Comments

timholy commented Jul 21, 2014

timholy commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

lindahua commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

JeffBezanson commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

timholy commented Jul 21, 2014

ArchRobison commented Jul 21, 2014

staticfloat commented Jul 21, 2014

rickhg12hs commented Jul 22, 2014

ArchRobison commented Jul 22, 2014

staticfloat commented Jul 22, 2014

JeffBezanson commented Jul 22, 2014

staticfloat commented Jul 22, 2014

ArchRobison commented Jul 22, 2014

staticfloat commented Jul 22, 2014

vtjnash commented Jul 23, 2014