Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ITensors] Fix broken broadcast operation on GPU #1532

Merged
merged 19 commits into from
Oct 13, 2024

Conversation

kmp5VT
Copy link
Collaborator

@kmp5VT kmp5VT commented Sep 19, 2024

Description

Aleix has found an issue which arises from the use of VectorInterface with GPU based ITensors. The problem is a broadcast function
a.= a .* l .+ b .* k
is being called and the anonymous function being generated is not able to be parsed on GPU. One way I could see to fix this is to call permute_dims!! alternatively I could split the function into two broadcasts. Right now I have a patch in VectorInterfaces. Below I have a pseudocode that can be added to ITensors and tested using JLArrays when we have that integrated.

Simple Example

using ITensors: Index, random_itensor
using Metal: mtl
using NDTensors: tensor

elt = Float32
A = random_itensor(elt, Index(5))
A .= A .* elt(1.0)
A .= A .* elt(1.0) .+ A .* elt(2.0)

mA = mtl(A)
mA .= mA .* elt(1.0) ## This works fine
mA .= mA .* elt(1.0) .+ mA .* elt(2.0) ## This fails with scalar indexing
mt = tensor(mA);
mt = mt .* elt(1.0)

Checklist:

  • Update the broadcast function
  • Add tests to ITensorVectorInterfaceExt library which can be tested later with JLArrays

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Sep 19, 2024

Hi @emstoudenmire and @abouco, I was able to make a minimal example of the issue. The code that is causing an issue is here

map!((r, t) -> β * r + α * t, T, T, A)

@kmp5VT kmp5VT marked this pull request as draft September 19, 2024 23:31
@mtfishman
Copy link
Member

mtfishman commented Sep 22, 2024

@kmp5VT I remember we previously faced issues with the GPU compiler struggling with certain ITensor broadcast operations that were written in terms of anonymous functions/closures, for example see the changes to broadcast.jl in this PR: #1194.

It seems like you were able to do pretty modest rewrites of some of the ITensor broadcast functions to circumvent those issues but I don't remember the details, perhaps something similar can be done here.

One of my suggestions at the time of #1194 was that you can always rewrite a closure as a callable struct (see the Julia docs on closures: https://docs.julialang.org/en/v1/devdocs/functions/#Closures), which may be easier for the GPU compiler to compile, if there isn't a simpler way to refactor the code to help the GPU compiler.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Oct 7, 2024

@mtfishman Thank you for the suggestion, I know we had made a PR to fix broadcasts before and it was helpful to have that as a reference. I tried the pattern we used in #1194 but the GPU compiler still complains that the values \alpha and \beta are not bits type. If I split the line into 2 broadcast calls that seems to work with minimal overhead. Though if you think it would better, I can modify the code and make a closure.
Thanks again!

@mtfishman
Copy link
Member

mtfishman commented Oct 7, 2024

I don't think we should split it into two lines because that is less performant (I expect what you wrote is roughly 2 times slower than the original version), so you will have to find another approach, whether that is making a struct that implements the closure or something else.

@mtfishman
Copy link
Member

Worst case, we can have two code branches, one for CPU that leaves the code as it is and one for GPU that splits the broadcast call into two calls, but hopefully there is a way to rewrite it so that it involves just one broadcast call and works on CPU and GPU.

@mtfishman
Copy link
Member

mtfishman commented Oct 7, 2024

Here is a demonstration that your new version will be slower (by roughly a factor of 1.5 in this case):

function f1!(T, A, α, β)
  T .= β .* T .+ α .* A
  return T
end

function f2!(T, A, α, β)
  T .= β .* T
  T .+= α .* A
  return T
end

T = randn(10^8)
A = randn(10^8)
α = 2.0
β = 3.0

using BenchmarkTools: @btime
T1 = copy(T)
@btime f1!($T1, $A, $α, $β)
T2 = copy(T)
@btime f2!($T2, $A, $α, $β)

which outputs:

  27.130 ms (0 allocations: 0 bytes)
  42.336 ms (0 allocations: 0 bytes)

It doesn't use more memory since it reuses the existing allocated memory, but does require more FLOPS since it isn't fusing all of the operations into a single loop.

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Oct 8, 2024

@mtfishman This makes sense as every element in T will have to be modified twice and would be especially bad if \beta = 1 . I ran btime but didn't think to look at the actual time difference between the two calls on CPU. I will work on code that works on CPU and GPU without splitting the line up. Thanks!

src/broadcast.jl Outdated Show resolved Hide resolved
src/broadcast.jl Outdated Show resolved Hide resolved
src/broadcast.jl Outdated Show resolved Hide resolved
@mtfishman
Copy link
Member

@kmp5VT it looks like the test failures are caused by Julia 1.11, which was just released and our tests automatically started testing against that.

Let's fix those issues in a separate PR from this one, and this and other open PRs can be based on those fixes once it is merged.

The ITensorMPS failure looks like it is caused by a change to the random number generator, in which case it can be fixed by switching to using StableRNGs.jl in the tests.

The NDTensors failure looks like some kind of slicing issue in NDTensors.BlockSparseArrays. That's not being used by end users anyway so could be ignored for now, maybe we could just disable those tests for now (mark them as broken) so they don't produce so much noise in the CI log and confusing failures in PRs. Though potentially that failure in in BlockSparseArrays caused by Julia 1.11 could cause issues for @ogauthe, as far as I know he is the only one really using that library right now (he's using it for his non-abelian fusion tensor work).

Julia 1.10 became the LTS (https://discourse.julialang.org/t/julia-v1-11-0-has-been-released-and-v1-10-is-now-lts/121064) so we can also change that to the oldest Julia version we support and just test against Julia 1.10 and 1.11, but that should probably be done in a PR of its own since that is something we will want to announce to users (I don't think it needs a breaking release of ITensor packages to change that but I have to think about that...).

@kmp5VT kmp5VT marked this pull request as ready for review October 9, 2024 01:43
@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Oct 9, 2024

@kmp5VT it looks like the test failures are caused by Julia 1.11, which was just released and our tests automatically started testing against that.

Oh I see, I was running on my mac with metal and see that the tests are passing. I am making an issue as a reminder to update our tests.

The ITensorMPS failure looks like it is caused by a change to the random number generator, in which case it can be fixed by switching to using StableRNGs.jl in the tests.

Okay I can try to do this as well in another PR. I can also open an issue as a reminder.

The NDTensors failure looks like some kind of slicing issue in NDTensors.BlockSparseArrays. That's not being used by end users anyway so could be ignored for now, maybe we could just disable those tests for now (mark them as broken) so they don't produce so much noise in the CI log and confusing failures in PRs. Though potentially that failure in in BlockSparseArrays caused by Julia 1.11 could cause issues for @ogauthe, as far as I know he is the only one really using that library right now (he's using it for his non-abelian fusion tensor work).

This is very useful information, thanks!

@mtfishman
Copy link
Member

@kmp5VT all of the test failures in this PR are in Julia 1.11, I see the Julia 1.6 tests pass. So I would be fine merging this once it is ready, say before #1539 is merged.

Is the original issue you were trying to address fixed? If so, could you add a test?

@kmp5VT
Copy link
Collaborator Author

kmp5VT commented Oct 9, 2024

@mtfishman yes the original issue is fixed now. It looks like there is a test here in the ITensor test suite but the issue was related to using GPU's with ITensor. Currently we are using GPUs in the ITensors suite so I am not really sure where to put a test. Do you have a suggestion?

@mtfishman
Copy link
Member

Thanks for the reminder that we're not running the ITensors.jl tests on GPU right now.

map!(axpby(α, β), T, T, A) gets lowered to some call in NDTensors.jl, probably a call to permutedims!. Can you add a test of that permutedims! call in the NDTensors.jl tests?

kmp5VT and others added 4 commits October 12, 2024 13:16
Co-authored-by: Matt Fishman <mtfishman@users.noreply.github.com>
Co-authored-by: Matt Fishman <mtfishman@users.noreply.github.com>
src/broadcast.jl Outdated Show resolved Hide resolved
@mtfishman mtfishman changed the title [ITensors][Bug] Broadcast function in ITensors fails to compile on GPU [ITensors] Broadcast function in ITensors fails to compile on GPU Oct 13, 2024
@mtfishman mtfishman changed the title [ITensors] Broadcast function in ITensors fails to compile on GPU [ITensors] Fix broadcast operation on GPU Oct 13, 2024
@mtfishman mtfishman changed the title [ITensors] Fix broadcast operation on GPU [ITensors] Fix broken broadcast operation on GPU Oct 13, 2024
@mtfishman
Copy link
Member

Note that the original issue may be related to more general issues that closures can have in Julia:

i.e. closures can have some type stability issues in certain cases, and certain GPU backends may have trouble compiling functions with type instabilities. Just pointing that out to potentially give some explanation and broader context for the original issue. Using a let block or FastClosures.jl could be alternative solutions for future issues like this, since they may be more general/automated than the current solution of making a custom callable object.

@mtfishman mtfishman merged commit 98a7724 into ITensor:main Oct 13, 2024
15 checks passed
corbett5 pushed a commit to corbett5/ITensors.jl that referenced this pull request Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants