- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@fastmath maximum
segfaults for Float16
on master
#49907
Comments
Can reproduce. Note that it can be triggered by julia> versioninfo()
Julia Version 1.10.0-DEV.1351
Commit a6ad9ea099f (2023-05-21 08:01 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 5 on 12 virtual cores
Environment:
LD_LIBRARY_PATH = /usr/local/cuda/lib64
JULIA_NUM_THREADS = 4
julia> @fastmath max(Float16(1), Float16(2))
Float16(2.0)
julia> @fastmath reduce(max, Float16[1,2,3])
Float16(3.0)
julia> @fastmath reduce(max, Float16[1,2,3]; init = Float16(0))
LLVM ERROR: Cannot select: 0x204fef8: v16f16 = X86ISD::FMAX nnan ninf nsz arcp contract afn reassoc 0x2031f30, 0x1eec5d0, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
0x2031f30: v16f16,ch = CopyFromReg 0x1d83098, Register:v16f16 %9, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ] Also triggered by some julia> foldl(Base.FastMath.max_fast, Float16[1, 2, 3])
LLVM ERROR: Cannot select: 0x248ba98: v16f16 = X86ISD::FMAX nnan ninf nsz arcp contract afn reassoc 0x24abec0, 0x24819c8, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ] On the same machine, a version from before #48153 does not have the problem: julia> @fastmath reduce(max, Float16[1,2,3]; init = Float16(0))
Float16(3.0)
julia> versioninfo()
Julia Version 1.10.0-DEV.220
Commit 9ded051e9f8 (2022-12-29 10:05 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu) On an M1 mac, the problem does not seem to occur: julia> @fastmath reduce(max, Float16[1,2,3]; init = Float16(0))
Float16(3.0)
julia> versioninfo()
Julia Version 1.10.0-DEV.1351
Commit a6ad9ea099 (2023-05-21 08:01 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.6.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 5 on 4 virtual cores |
This is probably an issue with the demote float16 pass. It would be cool to see the LLVM IR generated on the function that crashes. |
julia> @code_llvm Base.FastMath.maximum_fast(Float16[1, 2, 3]; init = Float16(0)) ; @ fastmath.jl:380 within `maximum_fast`
define half @julia_maximum_fast_289([1 x half]* nocapture noundef nonnull readonly align 2 dereferenceable(2) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
%thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #9
%ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
%ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
%pgcstack = load {}***, {}**** %ppgcstack, align 8
%ptls_field16 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2
%2 = bitcast {}*** %ptls_field16 to i64***
%ptls_load1718 = load i64**, i64*** %2, align 8
%3 = getelementptr inbounds i64*, i64** %ptls_load1718, i64 2
%safepoint = load i64*, i64** %3, align 8
fence syncscope("singlethread") seq_cst
%4 = load volatile i64, i64* %safepoint, align 8
fence syncscope("singlethread") seq_cst
; ┌ @ fastmath.jl:380 within `#maximum_fast#1`
; │┌ @ reducedim.jl:406 within `reduce`
; ││┌ @ reducedim.jl:406 within `#reduce#811`
; │││┌ @ reducedim.jl:357 within `mapreduce`
%5 = getelementptr inbounds [1 x half], [1 x half]* %0, i64 0, i64 0
; ││││┌ @ reducedim.jl:357 within `#mapreduce#809`
; │││││┌ @ reducedim.jl:362 within `_mapreduce_dim`
; ││││││┌ @ reduce.jl:44 within `mapfoldl_impl`
; │││││││┌ @ reduce.jl:48 within `foldl_impl`
; ││││││││┌ @ reduce.jl:56 within `_foldl_impl`
; │││││││││┌ @ array.jl:938 within `iterate` @ array.jl:938
; ││││││││││┌ @ essentials.jl:10 within `length`
%6 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
%7 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %6, i64 0, i32 1
%8 = load i64, i64* %7, align 8
; ││││││││││└
; ││││││││││┌ @ int.jl:520 within `<` @ int.jl:513
%.not = icmp eq i64 %8, 0
; ││││││││││└
br i1 %.not, label %L19, label %L20
L19: ; preds = %top
%9 = load half, half* %5, align 2
br label %L55
L20: ; preds = %top
; ││││││││││┌ @ essentials.jl:13 within `getindex`
%10 = bitcast {}* %1 to half**
%11 = load half*, half** %10, align 8
%12 = load half, half* %11, align 2
; │││││││││└└
; │││││││││ @ reduce.jl:58 within `_foldl_impl`
; │││││││││┌ @ reduce.jl:86 within `BottomRF`
; ││││││││││┌ @ fastmath.jl:251 within `max_fast`
; │││││││││││┌ @ fastmath.jl:191 within `gt_fast`
; ││││││││││││┌ @ fastmath.jl:189 within `lt_fast`
%13 = load half, half* %5, align 2
; │││││││││││└└
; │││││││││││┌ @ essentials.jl:621 within `ifelse`
%.inv = fcmp fast olt half %13, %12
%14 = select fast i1 %.inv, half %12, half %13
; │││││││││└└└
; │││││││││ @ reduce.jl:60 within `_foldl_impl`
; │││││││││┌ @ array.jl:938 within `iterate`
; ││││││││││┌ @ int.jl:520 within `<` @ int.jl:513
%.not1926.not = icmp eq i64 %8, 1
; ││││││││││└
br i1 %.not1926.not, label %L55, label %iter.check
iter.check: ; preds = %L20
%15 = add nsw i64 %8, -1
%min.iters.check = icmp ult i64 %15, 8
br i1 %min.iters.check, label %vec.epilog.scalar.ph, label %vector.main.loop.iter.check
vector.main.loop.iter.check: ; preds = %iter.check
%min.iters.check29 = icmp ult i64 %15, 32
br i1 %min.iters.check29, label %vec.epilog.ph, label %vector.ph
vector.ph: ; preds = %vector.main.loop.iter.check
%n.vec = and i64 %15, -32
%minmax.ident.splatinsert = insertelement <16 x half> poison, half %14, i64 0
%minmax.ident.splat = shufflevector <16 x half> %minmax.ident.splatinsert, <16 x half> poison, <16 x i32> zeroinitializer
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %22, %vector.body ]
%vec.phi30 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %23, %vector.body ]
%offset.idx = or i64 %index, 1
; ││││││││││┌ @ essentials.jl:13 within `getindex`
%16 = getelementptr inbounds half, half* %11, i64 %offset.idx
%17 = bitcast half* %16 to <16 x half>*
%wide.load = load <16 x half>, <16 x half>* %17, align 2
%18 = getelementptr inbounds half, half* %16, i64 16
%19 = bitcast half* %18 to <16 x half>*
%wide.load31 = load <16 x half>, <16 x half>* %19, align 2
; │││││││││└└
; │││││││││ @ reduce.jl:62 within `_foldl_impl`
; │││││││││┌ @ reduce.jl:86 within `BottomRF`
; ││││││││││┌ @ fastmath.jl:251 within `max_fast`
; │││││││││││┌ @ essentials.jl:621 within `ifelse`
%20 = fcmp fast olt <16 x half> %vec.phi, %wide.load
%21 = fcmp fast olt <16 x half> %vec.phi30, %wide.load31
%22 = select <16 x i1> %20, <16 x half> %wide.load, <16 x half> %vec.phi
%23 = select <16 x i1> %21, <16 x half> %wide.load31, <16 x half> %vec.phi30
%index.next = add nuw i64 %index, 32
%24 = icmp eq i64 %index.next, %n.vec
br i1 %24, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
; │││││││││└└└
; │││││││││ @ reduce.jl:60 within `_foldl_impl`
; │││││││││┌ @ array.jl:938 within `iterate`
%25 = call fast <16 x half> @llvm.maxnum.v16f16(<16 x half> %22, <16 x half> %23)
%26 = call fast half @llvm.vector.reduce.fmax.v16f16(<16 x half> %25)
%cmp.n = icmp eq i64 %15, %n.vec
br i1 %cmp.n, label %L55, label %vec.epilog.iter.check
vec.epilog.iter.check: ; preds = %middle.block
%ind.end36 = or i64 %n.vec, 2
%ind.end34 = or i64 %n.vec, 1
%n.vec.remaining = and i64 %15, 24
%min.epilog.iters.check = icmp eq i64 %n.vec.remaining, 0
br i1 %min.epilog.iters.check, label %vec.epilog.scalar.ph, label %vec.epilog.ph
vec.epilog.ph: ; preds = %vec.epilog.iter.check, %vector.main.loop.iter.check
%bc.merge.rdx = phi half [ %14, %vector.main.loop.iter.check ], [ %26, %vec.epilog.iter.check ]
%vec.epilog.resume.val = phi i64 [ 0, %vector.main.loop.iter.check ], [ %n.vec, %vec.epilog.iter.check ]
%n.vec33 = and i64 %15, -8
%ind.end = or i64 %n.vec33, 1
%ind.end35 = or i64 %n.vec33, 2
%minmax.ident.splatinsert41 = insertelement <8 x half> poison, half %bc.merge.rdx, i64 0
%minmax.ident.splat42 = shufflevector <8 x half> %minmax.ident.splatinsert41, <8 x half> poison, <8 x i32> zeroinitializer
br label %vec.epilog.vector.body
vec.epilog.vector.body: ; preds = %vec.epilog.vector.body, %vec.epilog.ph
%index39 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next45, %vec.epilog.vector.body ]
%vec.phi40 = phi <8 x half> [ %minmax.ident.splat42, %vec.epilog.ph ], [ %30, %vec.epilog.vector.body ]
%offset.idx43 = or i64 %index39, 1
; ││││││││││┌ @ essentials.jl:13 within `getindex`
%27 = getelementptr inbounds half, half* %11, i64 %offset.idx43
%28 = bitcast half* %27 to <8 x half>*
%wide.load44 = load <8 x half>, <8 x half>* %28, align 2
; │││││││││└└
; │││││││││ @ reduce.jl:62 within `_foldl_impl`
; │││││││││┌ @ reduce.jl:86 within `BottomRF`
; ││││││││││┌ @ fastmath.jl:251 within `max_fast`
; │││││││││││┌ @ essentials.jl:621 within `ifelse`
%29 = fcmp fast olt <8 x half> %vec.phi40, %wide.load44
%30 = select <8 x i1> %29, <8 x half> %wide.load44, <8 x half> %vec.phi40
%index.next45 = add nuw i64 %index39, 8
%31 = icmp eq i64 %index.next45, %n.vec33
br i1 %31, label %vec.epilog.middle.block, label %vec.epilog.vector.body
vec.epilog.middle.block: ; preds = %vec.epilog.vector.body
; │││││││││└└└
; │││││││││ @ reduce.jl:60 within `_foldl_impl`
; │││││││││┌ @ array.jl:938 within `iterate`
%32 = call fast half @llvm.vector.reduce.fmax.v8f16(<8 x half> %30)
%cmp.n38 = icmp eq i64 %15, %n.vec33
br i1 %cmp.n38, label %L55, label %vec.epilog.scalar.ph
vec.epilog.scalar.ph: ; preds = %vec.epilog.middle.block, %vec.epilog.iter.check, %iter.check
%bc.resume.val = phi i64 [ %ind.end, %vec.epilog.middle.block ], [ %ind.end34, %vec.epilog.iter.check ], [ 1, %iter.check ]
%bc.resume.val37 = phi i64 [ %ind.end35, %vec.epilog.middle.block ], [ %ind.end36, %vec.epilog.iter.check ], [ 2, %iter.check ]
%bc.merge.rdx46 = phi half [ %32, %vec.epilog.middle.block ], [ %26, %vec.epilog.iter.check ], [ %14, %iter.check ]
br label %L42
L42: ; preds = %L42, %vec.epilog.scalar.ph
%33 = phi i64 [ %value_phi628, %L42 ], [ %bc.resume.val, %vec.epilog.scalar.ph ]
%value_phi628 = phi i64 [ %36, %L42 ], [ %bc.resume.val37, %vec.epilog.scalar.ph ]
%value_phi527 = phi half [ %37, %L42 ], [ %bc.merge.rdx46, %vec.epilog.scalar.ph ]
; ││││││││││┌ @ essentials.jl:13 within `getindex`
%34 = getelementptr inbounds half, half* %11, i64 %33
%35 = load half, half* %34, align 2
; ││││││││││└
; ││││││││││┌ @ int.jl:87 within `+`
%36 = add nuw nsw i64 %value_phi628, 1
; │││││││││└└
; │││││││││ @ reduce.jl:62 within `_foldl_impl`
; │││││││││┌ @ reduce.jl:86 within `BottomRF`
; ││││││││││┌ @ fastmath.jl:251 within `max_fast`
; │││││││││││┌ @ essentials.jl:621 within `ifelse`
%.inv20 = fcmp fast olt half %value_phi527, %35
%37 = select fast i1 %.inv20, half %35, half %value_phi527
; │││││││││└└└
; │││││││││ @ reduce.jl:60 within `_foldl_impl`
; │││││││││┌ @ array.jl:938 within `iterate`
; ││││││││││┌ @ int.jl:520 within `<` @ int.jl:513
%exitcond.not = icmp eq i64 %value_phi628, %8
; ││││││││││└
br i1 %exitcond.not, label %L55, label %L42
L55: ; preds = %L42, %vec.epilog.middle.block, %middle.block, %L20, %L19
%value_phi4 = phi half [ %9, %L19 ], [ %14, %L20 ], [ %26, %middle.block ], [ %32, %vec.epilog.middle.block ], [ %37, %L42 ]
; └└└└└└└└└└
ret half %value_phi4
} So the problem is that those |
I love that llvm creates an intrinsic that it doesn't know how to lower. |
This is llvm/llvm-project#59258, which got fixed in https://reviews.llvm.org/D139078. I saw that @maleadt was adding some patches so could we get this on as well? |
I just finished rebuilding all of LLVM 😅 |
I'm so sorry |
- llvm/llvm-project@af39acd closing #50448 - https://reviews.llvm.org/D139078 closing #49907 (cherry picked from commit 092231c)
On master I get
Complete output
It works fine with Julia 1.9.0.
Float32
andFloat64
don't seem to be affected, and without@fastmath
it also works forFloat16
.The text was updated successfully, but these errors were encountered: