Quite different performance of pairwise on CPU vs GPU #143

xukai92 · 2019-08-15T22:49:45Z

I found the pairwise with SqEuclidean is faster than my own implementation on CPU but slower on GPU. Any idea why and possible optimization on Distances.jl side?

MWE:

using Test, BenchmarkTools, Distances, CuArrays

function pairwise_dot_kai(x)
    d, n = size(x)
    xixj = x' * x
    xsq = sum(x .^ 2; dims=1)
    return repeat(xsq, n, 1) + repeat(xsq', 1, n) - 2xixj
end

pairwise_dot(x) = pairwise(SqEuclidean(), x; dims=2)

xbench = randn(Float32, 784, 200);

@benchmark pairwise_dot_kai(xbench)

BenchmarkTools.Trial: 
  memory estimate:  1.52 MiB
  allocs estimate:  17
  --------------
  minimum time:     854.227 μs (0.00% GC)
  median time:      1.183 ms (0.00% GC)
  mean time:        1.361 ms (12.37% GC)
  maximum time:     125.259 ms (98.46% GC)
  --------------
  samples:          3662
  evals/sample:     1

@benchmark pairwise_dot(xbench)

BenchmarkTools.Trial: 
  memory estimate:  166.59 KiB
  allocs estimate:  204
  --------------
  minimum time:     359.751 μs (0.00% GC)
  median time:      406.615 μs (0.00% GC)
  mean time:        458.925 μs (6.46% GC)
  maximum time:     104.066 ms (99.27% GC)
  --------------
  samples:          10000
  evals/sample:     1

xbench = xbench |> cu;
@benchmark pairwise_dot_kai(xbench)

BenchmarkTools.Trial: 
  memory estimate:  1.20 MiB
  allocs estimate:  19424
  --------------
  minimum time:     19.042 ms (0.00% GC)
  median time:      20.028 ms (0.00% GC)
  mean time:        21.811 ms (3.62% GC)
  maximum time:     52.425 ms (38.23% GC)
  --------------
  samples:          230
  evals/sample:     1

@benchmark pairwise_dot(xbench)

BenchmarkTools.Trial: 
  memory estimate:  10.99 MiB
  allocs estimate:  240635
  --------------
  minimum time:     453.229 ms (0.00% GC)
  median time:      470.074 ms (0.00% GC)
  mean time:        474.353 ms (2.67% GC)
  maximum time:     499.969 ms (6.04% GC)
  --------------
  samples:          11
  evals/sample:     1

The text was updated successfully, but these errors were encountered:

xukai92 · 2019-08-15T22:50:28Z

PS: the GPU support of pairwise is based the branch of this PR #142

nalimilan · 2019-08-20T11:53:01Z

pairwise uses an explicit loop, while your implementation calls higher-level implementations like * and broadcast. I guess these are optimized for CUDA. Not sure what to do about this, since 1) pairwise works for many distances, some of which might not be rewritten using high-level operations and 2) Distances.jl shouldn't depend on CuArrays. Maybe we can add special methods once optional dependencies are supported in Julia.

johnnychen94 · 2019-08-20T12:05:49Z

Maybe we can add special methods once optional dependencies are supported in Julia.

I think Requires.jl provides this feature?

KristofferC · 2019-08-20T12:17:47Z

Requires.jl makes loading this package 20x slower (#123 (comment)).

johnnychen94 · 2019-08-20T12:38:44Z

Requires.jl makes loading this package 20x slower

Technically, it's Tables.jl. In my laptop, Requires.jl slows loading this package by 4-6x

# with Requires.jl
julia> @time using Distances
  0.276511 seconds (403.27 k allocations: 21.736 MiB, 6.36% gc time)

# without Requires.jl
julia> @time using Distances
  0.063862 seconds (63.23 k allocations: 3.801 MiB)

But Requires.jl is still a heavy dependency compared to Distances.jl. I guess what we can possibly do now is to create a larger package CudaDistances.jl, accelerate codes for CuArray, and reexport Distances.jl in that package.

nalimilan · 2020-09-11T07:39:28Z

One solution to this would be to choose the best algorithm based on traits like those provided by ArrayInterface.

johnnychen94 mentioned this issue Jun 12, 2021

add Vectorization implementation for GPU #223

Closed

rossviljoen mentioned this issue Aug 2, 2021

GPU Support JuliaGaussianProcesses/ApproximateGPs.jl#15

Open

3 tasks

devmotion mentioned this issue Oct 19, 2021

Issue with Distances.jl JuliaGaussianProcesses/KernelFunctions.jl#380

Open

This was referenced Oct 19, 2021

[WIP] Use Tullio for pairwise distances JuliaGaussianProcesses/KernelFunctions.jl#385

Closed

[WIP] Use Tullio for pairwise distances JuliaGaussianProcesses/KernelFunctions.jl#386

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quite different performance of pairwise on CPU vs GPU #143

Quite different performance of pairwise on CPU vs GPU #143

xukai92 commented Aug 15, 2019

xukai92 commented Aug 15, 2019

nalimilan commented Aug 20, 2019

johnnychen94 commented Aug 20, 2019

KristofferC commented Aug 20, 2019

johnnychen94 commented Aug 20, 2019

nalimilan commented Sep 11, 2020

Quite different performance of pairwise on CPU vs GPU #143

Quite different performance of pairwise on CPU vs GPU #143

Comments

xukai92 commented Aug 15, 2019

xukai92 commented Aug 15, 2019

nalimilan commented Aug 20, 2019

johnnychen94 commented Aug 20, 2019

KristofferC commented Aug 20, 2019

johnnychen94 commented Aug 20, 2019

nalimilan commented Sep 11, 2020