Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quite different performance of pairwise on CPU vs GPU #143

Open
xukai92 opened this issue Aug 15, 2019 · 6 comments
Open

Quite different performance of pairwise on CPU vs GPU #143

xukai92 opened this issue Aug 15, 2019 · 6 comments

Comments

@xukai92
Copy link

xukai92 commented Aug 15, 2019

I found the pairwise with SqEuclidean is faster than my own implementation on CPU but slower on GPU. Any idea why and possible optimization on Distances.jl side?

MWE:

using Test, BenchmarkTools, Distances, CuArrays

function pairwise_dot_kai(x)
    d, n = size(x)
    xixj = x' * x
    xsq = sum(x .^ 2; dims=1)
    return repeat(xsq, n, 1) + repeat(xsq', 1, n) - 2xixj
end

pairwise_dot(x) = pairwise(SqEuclidean(), x; dims=2)

xbench = randn(Float32, 784, 200);

@benchmark pairwise_dot_kai(xbench)
BenchmarkTools.Trial: 
  memory estimate:  1.52 MiB
  allocs estimate:  17
  --------------
  minimum time:     854.227 μs (0.00% GC)
  median time:      1.183 ms (0.00% GC)
  mean time:        1.361 ms (12.37% GC)
  maximum time:     125.259 ms (98.46% GC)
  --------------
  samples:          3662
  evals/sample:     1
@benchmark pairwise_dot(xbench)
BenchmarkTools.Trial: 
  memory estimate:  166.59 KiB
  allocs estimate:  204
  --------------
  minimum time:     359.751 μs (0.00% GC)
  median time:      406.615 μs (0.00% GC)
  mean time:        458.925 μs (6.46% GC)
  maximum time:     104.066 ms (99.27% GC)
  --------------
  samples:          10000
  evals/sample:     1
xbench = xbench |> cu;
@benchmark pairwise_dot_kai(xbench)
BenchmarkTools.Trial: 
  memory estimate:  1.20 MiB
  allocs estimate:  19424
  --------------
  minimum time:     19.042 ms (0.00% GC)
  median time:      20.028 ms (0.00% GC)
  mean time:        21.811 ms (3.62% GC)
  maximum time:     52.425 ms (38.23% GC)
  --------------
  samples:          230
  evals/sample:     1
@benchmark pairwise_dot(xbench)
BenchmarkTools.Trial: 
  memory estimate:  10.99 MiB
  allocs estimate:  240635
  --------------
  minimum time:     453.229 ms (0.00% GC)
  median time:      470.074 ms (0.00% GC)
  mean time:        474.353 ms (2.67% GC)
  maximum time:     499.969 ms (6.04% GC)
  --------------
  samples:          11
  evals/sample:     1
@xukai92
Copy link
Author

xukai92 commented Aug 15, 2019

PS: the GPU support of pairwise is based the branch of this PR #142

@nalimilan
Copy link
Member

pairwise uses an explicit loop, while your implementation calls higher-level implementations like * and broadcast. I guess these are optimized for CUDA. Not sure what to do about this, since 1) pairwise works for many distances, some of which might not be rewritten using high-level operations and 2) Distances.jl shouldn't depend on CuArrays. Maybe we can add special methods once optional dependencies are supported in Julia.

@johnnychen94
Copy link
Contributor

Maybe we can add special methods once optional dependencies are supported in Julia.

I think Requires.jl provides this feature?

@KristofferC
Copy link
Member

Requires.jl makes loading this package 20x slower (#123 (comment)).

@johnnychen94
Copy link
Contributor

Requires.jl makes loading this package 20x slower

Technically, it's Tables.jl. In my laptop, Requires.jl slows loading this package by 4-6x

# with Requires.jl
julia> @time using Distances
  0.276511 seconds (403.27 k allocations: 21.736 MiB, 6.36% gc time)

# without Requires.jl
julia> @time using Distances
  0.063862 seconds (63.23 k allocations: 3.801 MiB)

But Requires.jl is still a heavy dependency compared to Distances.jl. I guess what we can possibly do now is to create a larger package CudaDistances.jl, accelerate codes for CuArray, and reexport Distances.jl in that package.

@nalimilan
Copy link
Member

One solution to this would be to choose the best algorithm based on traits like those provided by ArrayInterface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants