Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Vectorization implementation for GPU #223

Closed
wants to merge 3 commits into from

Conversation

johnnychen94
Copy link
Contributor

@johnnychen94 johnnychen94 commented Jun 12, 2021

One might even come up with a threaded reduce version for this, but that definitely belongs to future work.

@time using Distances
# PR: 0.284294 seconds (232.28 k allocations: 14.743 MiB, 30.94% compilation time)
# master: 0.051195 seconds (102.74 k allocations: 6.502 MiB, 85.42% compilation time)
using CUDA

X_cu = CUDA.randn(2^20);
Y_cu = CUDA.randn(2^20);

X = randn(2^20);
Y = randn(2^20);

@btime sqeuclidean($X, $Y)
# PR: 444.933 μs (0 allocations: 0 bytes)
# master: 418.690 μs (0 allocations: 0 bytes)
# almost a benchmark noisy

@btime sqeuclidean($X_cu, $Y_cu)
# GTX 3090
# PR: 51.827 μs (207 allocations: 5.64 KiB)
# master: take decades because of the scalar indexing

I did not expose this trait dispatching to the API level because I'm not sure if the user really wants this. Users might only want to control the computational device(e.g., GPU, CPU, CPU threads, CPU distributed) and may not want to control the under the hood implementation details.


from @KristofferC in #143 (comment)

Requires.jl makes loading this package 20x slower (#123 (comment)).

I still feel the runtime benefit worths it; not to mention that GPU support is basically unavailable before. Given that a lot of packages depend on ArrayInterface, adding this dependency won't add much of the loading time in any real project. (From juliahub results: ArrayInterface is depended by 692 packages while Distances is defended by 474 packages.)

julia> @time using ArrayInterface
  0.228708 seconds (224.97 k allocations: 13.884 MiB, 26.09% compilation time)

julia> @time using Distances
  0.014110 seconds (6.62 k allocations: 604.219 KiB)

Edit:

From @nalimilan in #143 (comment)

One solution to this would be to choose the best algorithm based on traits like those provided by ArrayInterface.

ArrayInterface requires Julia 1.2. I'll let you guys decide whether 1) we want to instead depend on Requires, or 2) directly bump the Julia lower bound here and drop Julia 1.0 and 1.1 support, or even 3) make a new package DistancesCUDA :(

closes #137

@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2021

Codecov Report

Merging #223 (d7d31cf) into master (b52f0a1) will decrease coverage by 1.23%.
The diff coverage is 57.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #223      +/-   ##
==========================================
- Coverage   98.26%   97.02%   -1.24%     
==========================================
  Files           8        8              
  Lines         690      705      +15     
==========================================
+ Hits          678      684       +6     
- Misses         12       21       +9     
Impacted Files Coverage Δ
src/metrics.jl 95.83% <52.63%> (-2.45%) ⬇️
src/generic.jl 98.64% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b52f0a1...d7d31cf. Read the comment docs.

Project.toml Outdated Show resolved Hide resolved
Comment on lines +226 to +227
for M in (metrics..., weightedmetrics...)
@eval @inline (dist::$M)(a, b) = _evaluate(dist, a, b)
Copy link
Contributor Author

@johnnychen94 johnnychen94 Jun 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is moved from L327-L329.

Comment on lines +385 to +392
Base.@propagate_inbounds function eval_start(d::Chebyshev, a, b)
T = result_type(d, a, b)
if any(isnan, a) || any(isnan, b)
return convert(T, NaN)
else
zero(T) # lower bound of chebyshev distance
end
end
Copy link
Contributor Author

@johnnychen94 johnnychen94 Jun 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rewritten to support CuArray; scalar indexing first is slow for CuArray.

@KristofferC
Copy link
Member

KristofferC commented Jun 12, 2021

A quite significant effort has been made to make adding this package as a dependency a no brainer by making it have few dependencies and thus a fast load time. Unfortunately (as you mentioned), looking at the dependency here, a 10x load time increase is observed:

julia> @time using ArrayInterface
  0.184645 seconds (143.03 k allocations: 9.379 MiB, 2.25% compilation time)
julia> @time using Distances
  0.017180 seconds (18.09 k allocations: 1.566 MiB, 30.99% compilation time)

Some possible suggestions:

  • Is it possible to profile ArrayInterface and make its load time in the same ballpark as Distances?
  • Is it possible to implement GPU support using standard Julia constructs like broadcasting?
  • Perhaps this code should go into one of the GPU libraries to not make everyone absorb this increased load time.

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 12, 2021

Is it possible to profile ArrayInterface and make its load time in the same ballpark as Distances?

No, as far as I've understood. ArrayInterface heavily uses Requires.jl

Is it possible to implement GPU support using standard Julia constructs like broadcasting?

Yes, if you are referring to the vectorization version I just added here. But it's still slower in CPU due to the additional memory allocation for intermediate results.

Perhaps this code should go into one of the GPU libraries to not make everyone absorb this increased load time.

The only place that requires ArrayInterface is infer_evaluate_strategy, we could only provide a fallback version here to unconditionally redirect to ScalarMapReduce, i.e., infer_evaluate_strategy(d::PreMetric, a, b) = ScalarMapReduce(). Then extends this method in other packages (e.g., DistancesCUDA.jl)

But I really don't think there's a need to be so conservative. For example, if we introduce LoopVectorization here and bring CPU runtime say >1.2x faster, does the overhead in loading time really matters? (Note that LoopVectorization depends on ArrayInterface)

@KristofferC
Copy link
Member

Looking a bit more at ArrayInterface it feels like it is way out of scope to add it to something so stable and lean as Distances.jl. For example, it does a lot of type piracy. I think it makes sense to add this to one of the packages that are in the "ArrayInterface" ecosystem (with possible modifications to Distance.jl to allow that to happen).

@nalimilan
Copy link
Member

It's too bad that ArrayInterface uses Requires.jl. I wish it would be a very lightweight package that others depend on, rather than the contrary. Or move the basic definitions to ArrayInterfaceBase so that packages can depend on that.

@johnnychen94
Copy link
Contributor Author

Or move the basic definitions to ArrayInterfaceBase so that packages can depend on that.

I prefer to have something like DistancesEx.jl in JuliaStats that allows heavier dependency for speed and glue purposes. Possible efforts to bring LoopVectorization support can be made there as well.

Or maybe we can just work in the same Distances repo /DistancesEx as a sub-dir package like SnoopCompile does.

@devmotion
Copy link
Member

If the main goal is GPU support, maybe one could also just create a GPU-specific DistancesCUDA package, similar to https://github.com/FluxML/NNlibCUDA.jl, with implementations of _evaluate(::UnionMetrics, a::CUDA.CuArray, b::CUDA.CuArray) etc.

@johnnychen94
Copy link
Contributor Author

Close it as ArrayInterface is too heavy a dependency for Distances. Without using ArrayInterface, it would be better to work out a DistancesCUDA.jl package with tweaked CUDA kernels.

@johnnychen94 johnnychen94 deleted the jc/vec branch June 14, 2021 08:30
@nikopj nikopj mentioned this pull request Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

how to calculate Euclidean distance using GPU ?
5 participants