Float16 performance regression on Mac M1 #45542

ctkelley · 2022-06-01T16:01:57Z

Float16 performance has gotten much worse with 1.8.9-rc1 on M1 Macs.

In 1.7.2

julia> A=rand(Float16,1024,1024);

julia> @btime lu!($A);
  234.959 ms (1 allocation: 8.12 KiB)

In 1.8.0-rc1

julia> A=rand(Float16,1024,1024);

julia> @btime lu!($A);
  8.593 s (1 allocation: 8.12 KiB)

Any chance for fixing this?

The text was updated successfully, but these errors were encountered:

maleadt · 2022-06-02T08:54:48Z

1.7:

julia> @code_llvm Float16(1)*Float16(1)
;  @ float.jl:403 within `*`
define half @"julia_*_117"(half %0, half %1) #0 {
top:
  %2 = fpext half %0 to float
  %3 = fpext half %1 to float
  %4 = fmul float %2, %3
  %5 = fptrunc float %4 to half
  ret half %5
}

julia> @code_native Float16(1)*Float16(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:403 within `*`
	fcvt	s0, h0
	fcvt	s1, h1
	fmul	s0, s0, s1
	fcvt	h0, s0
	ret
; └

1.8:

julia> @code_llvm Float16(1)*Float16(1)
;  @ float.jl:385 within `*`
define half @"julia_*_211"(half %0, half %1) #0 {
top:
  %2 = bitcast half %0 to i16
  %3 = call float @julia__gnu_h2f_ieee(i16 %2)
  %4 = bitcast half %1 to i16
  %5 = call float @julia__gnu_h2f_ieee(i16 %4)
  %6 = fmul float %3, %5
  %7 = call i16 @julia__gnu_f2h_ieee(float %6)
  %8 = bitcast i16 %7 to half
  ret half %8
}

julia> @code_native Float16(1)*Float16(1)
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	"_julia_*_235"                  ; -- Begin function julia_*_235
	.p2align	2
"_julia_*_235":                         ; @"julia_*_235"
; ┌ @ float.jl:385 within `*`
	.cfi_startproc
; %bb.0:                                ; %top
	stp	d9, d8, [sp, #-32]!             ; 16-byte Folded Spill
	stp	x29, x30, [sp, #16]             ; 16-byte Folded Spill
	.cfi_def_cfa_offset 32
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	.cfi_offset b8, -24
	.cfi_offset b9, -32
	mov	v8.16b, v1.16b
                                        ; kill: def $h0 killed $h0 def $s0
	fmov	w0, s0
	bl	_julia__gnu_h2f_ieee
	mov	v9.16b, v0.16b
	fmov	w0, s8
	bl	_julia__gnu_h2f_ieee
	fmul	s0, s9, s0
	bl	_julia__gnu_f2h_ieee
	fmov	s0, w0
                                        ; kill: def $h0 killed $h0 killed $s0
	ldp	x29, x30, [sp, #16]             ; 16-byte Folded Reload
	ldp	d9, d8, [sp], #32               ; 16-byte Folded Reload
	ret
	.cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

Both native aarch64 builds. Bisected to (the back-port of) #45249:

eb82f1846fdc2d195ab1031fe192b27fd171f148 is the first bad commit

    codegen: explicitly handle Float16 intrinsics (#45249)

    Fixes #44829, until llvm fixes the support for these intrinsics itself

    Also need to handle vectors, since the vectorizer may have introduced them.

    Also change our runtime emulation versions to f32 for consistency.

    (cherry picked from commit f2c627ef8af37c3cf94c19a5403bc6cd796d5031)

vchuravy · 2022-06-26T16:04:16Z

This is now fixed on 1.8-rc and master.

vchuravy added performance Must go faster regression Regression in behavior compared to a previous version float16 labels Jun 1, 2022

maleadt assigned vtjnash Jun 2, 2022

maleadt added this to the 1.8 milestone Jun 2, 2022

KristofferC mentioned this issue Jun 9, 2022

release-1.8: Revert "codegen: explicitly handle Float16 intrinsics (#45249)" #45627

Merged

vchuravy closed this as completed Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float16 performance regression on Mac M1 #45542

Float16 performance regression on Mac M1 #45542

ctkelley commented Jun 1, 2022

maleadt commented Jun 2, 2022

vchuravy commented Jun 26, 2022

Float16 performance regression on Mac M1 #45542

Float16 performance regression on Mac M1 #45542

Comments

ctkelley commented Jun 1, 2022

maleadt commented Jun 2, 2022

vchuravy commented Jun 26, 2022