Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float16 performance regression on Mac M1 #45542

Closed
ctkelley opened this issue Jun 1, 2022 · 2 comments
Closed

Float16 performance regression on Mac M1 #45542

ctkelley opened this issue Jun 1, 2022 · 2 comments
Assignees
Labels
float16 performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@ctkelley
Copy link

ctkelley commented Jun 1, 2022

Float16 performance has gotten much worse with 1.8.9-rc1 on M1 Macs.

In 1.7.2

julia> A=rand(Float16,1024,1024);

julia> @btime lu!($A);
  234.959 ms (1 allocation: 8.12 KiB)

In 1.8.0-rc1

julia> A=rand(Float16,1024,1024);

julia> @btime lu!($A);
  8.593 s (1 allocation: 8.12 KiB)

Any chance for fixing this?

@vchuravy vchuravy added performance Must go faster regression Regression in behavior compared to a previous version float16 labels Jun 1, 2022
@maleadt
Copy link
Member

maleadt commented Jun 2, 2022

1.7:

julia> @code_llvm Float16(1)*Float16(1)
;  @ float.jl:403 within `*`
define half @"julia_*_117"(half %0, half %1) #0 {
top:
  %2 = fpext half %0 to float
  %3 = fpext half %1 to float
  %4 = fmul float %2, %3
  %5 = fptrunc float %4 to half
  ret half %5
}

julia> @code_native Float16(1)*Float16(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ float.jl:403 within `*`
	fcvt	s0, h0
	fcvt	s1, h1
	fmul	s0, s0, s1
	fcvt	h0, s0
	ret
; └

1.8:

julia> @code_llvm Float16(1)*Float16(1)
;  @ float.jl:385 within `*`
define half @"julia_*_211"(half %0, half %1) #0 {
top:
  %2 = bitcast half %0 to i16
  %3 = call float @julia__gnu_h2f_ieee(i16 %2)
  %4 = bitcast half %1 to i16
  %5 = call float @julia__gnu_h2f_ieee(i16 %4)
  %6 = fmul float %3, %5
  %7 = call i16 @julia__gnu_f2h_ieee(float %6)
  %8 = bitcast i16 %7 to half
  ret half %8
}

julia> @code_native Float16(1)*Float16(1)
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	"_julia_*_235"                  ; -- Begin function julia_*_235
	.p2align	2
"_julia_*_235":                         ; @"julia_*_235"
; ┌ @ float.jl:385 within `*`
	.cfi_startproc
; %bb.0:                                ; %top
	stp	d9, d8, [sp, #-32]!             ; 16-byte Folded Spill
	stp	x29, x30, [sp, #16]             ; 16-byte Folded Spill
	.cfi_def_cfa_offset 32
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	.cfi_offset b8, -24
	.cfi_offset b9, -32
	mov	v8.16b, v1.16b
                                        ; kill: def $h0 killed $h0 def $s0
	fmov	w0, s0
	bl	_julia__gnu_h2f_ieee
	mov	v9.16b, v0.16b
	fmov	w0, s8
	bl	_julia__gnu_h2f_ieee
	fmul	s0, s9, s0
	bl	_julia__gnu_f2h_ieee
	fmov	s0, w0
                                        ; kill: def $h0 killed $h0 killed $s0
	ldp	x29, x30, [sp, #16]             ; 16-byte Folded Reload
	ldp	d9, d8, [sp], #32               ; 16-byte Folded Reload
	ret
	.cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

Both native aarch64 builds. Bisected to (the back-port of) #45249:

eb82f1846fdc2d195ab1031fe192b27fd171f148 is the first bad commit

    codegen: explicitly handle Float16 intrinsics (#45249)

    Fixes #44829, until llvm fixes the support for these intrinsics itself

    Also need to handle vectors, since the vectorizer may have introduced them.

    Also change our runtime emulation versions to f32 for consistency.

    (cherry picked from commit f2c627ef8af37c3cf94c19a5403bc6cd796d5031)

@vchuravy
Copy link
Member

This is now fixed on 1.8-rc and master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
float16 performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants