Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda heat example w quaditer #913

Draft
wants to merge 156 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
156 commits
Select commit Hold shift + click to select a range
a979fb2
Initial ideas
KnutAM Jan 11, 2024
298158c
Working implementation
KnutAM Jan 11, 2024
1794db3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 24, 2024
51ab4f2
Add static values version and improve interface
KnutAM Feb 24, 2024
22a7377
Add dev example and test
KnutAM Feb 24, 2024
18377f3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 28, 2024
27a3a96
Add StaticCellValues without stored cell coordinates
KnutAM Feb 28, 2024
95b5729
initial ideas
Abdelrahman912 May 7, 2024
d4e881d
minor changes
Abdelrahman912 May 14, 2024
f55b878
Merge branch 'Ferrite-FEM:master' into cuda-heat-example-w-quaditer
Abdelrahman912 May 14, 2024
c1ef6ad
add some abstractions
Abdelrahman912 May 23, 2024
394ac6a
add minor comment
Abdelrahman912 May 23, 2024
1f0df67
add z dierction for numerical integration
Abdelrahman912 May 30, 2024
3152042
add Float32
Abdelrahman912 Jun 4, 2024
aac5994
minor fix
Abdelrahman912 Jun 4, 2024
142f89a
init coloring implementation
Abdelrahman912 Jun 18, 2024
eaff534
init working on the assembler
Abdelrahman912 Jun 19, 2024
ffdc341
init gpu_assembler
Abdelrahman912 Jun 20, 2024
59595e8
implement naive gpu_assembler
Abdelrahman912 Jun 20, 2024
0e3cb21
minor fix
Abdelrahman912 Jun 20, 2024
687141d
use CuSparseMatrixCSC in assembler
Abdelrahman912 Jun 26, 2024
11d5a01
minor fix
Abdelrahman912 Jun 26, 2024
d5c951c
minor fix
Abdelrahman912 Jun 26, 2024
f4272a6
hoist dh, cellvalues, assembler outside the cuda loop
Abdelrahman912 Jun 26, 2024
d5cf949
add run_gpu macro
Abdelrahman912 Jun 26, 2024
2e52de1
init using int32 instead of int64 to reduce number of registers
Abdelrahman912 Jul 3, 2024
2cd0168
finish use int32
Abdelrahman912 Jul 3, 2024
54922ab
stupid way to circumvent rubbish values
Abdelrahman912 Jul 4, 2024
9406ff9
add discorse ref
Abdelrahman912 Jul 4, 2024
8fedba5
add ncu benchmark
Abdelrahman912 Jul 4, 2024
8bd417a
fix error in benchmark and add ref.
Abdelrahman912 Jul 4, 2024
abf11b6
set the code for debugging
Abdelrahman912 Jul 8, 2024
4f85cf5
init test
Abdelrahman912 Jul 8, 2024
4935b70
fix adapt issue
Abdelrahman912 Jul 8, 2024
188cceb
remove unnecessary cushow
Abdelrahman912 Jul 8, 2024
9c904e4
add heat equation main test set
Abdelrahman912 Jul 8, 2024
06432db
remove unncessary comments
Abdelrahman912 Jul 8, 2024
a67caaa
add nsys benchmark
Abdelrahman912 Jul 8, 2024
ecee17f
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Jul 8, 2024
60edda9
fix some issues regarding the merge
Abdelrahman912 Jul 8, 2024
063ff7a
minor fix
Abdelrahman912 Jul 8, 2024
9206be3
remove nsight files
Abdelrahman912 Jul 8, 2024
1eeb568
minor fix
Abdelrahman912 Jul 8, 2024
5e339a0
add comments
Abdelrahman912 Jul 8, 2024
204f3be
minor fix
Abdelrahman912 Jul 8, 2024
0f2e6b7
add comments
Abdelrahman912 Jul 8, 2024
7100e0a
fix for CI
Abdelrahman912 Jul 8, 2024
f129449
fix for CI
Abdelrahman912 Jul 8, 2024
618adb5
CI fix
Abdelrahman912 Jul 8, 2024
78f120c
ci
Abdelrahman912 Jul 8, 2024
4971cba
minor fix
Abdelrahman912 Jul 8, 2024
ea8451c
fix ci
Abdelrahman912 Jul 8, 2024
986c5db
remove file
Abdelrahman912 Jul 8, 2024
f93fdfb
add CUDA to docs project
Abdelrahman912 Jul 8, 2024
f442ae2
add v2 for gpu_heat_equation
Abdelrahman912 Jul 15, 2024
81274d5
add adapt to docs
Abdelrahman912 Jul 15, 2024
fbc05ed
minor fix
Abdelrahman912 Jul 22, 2024
506328c
init assemble per dof
Abdelrahman912 Jul 22, 2024
b505189
assemble global v3
Abdelrahman912 Jul 22, 2024
b0a94aa
minor fix
Abdelrahman912 Jul 22, 2024
aa3d1ae
add comment + start in v4
Abdelrahman912 Jul 31, 2024
c8cf6fe
add map dof to elements
Abdelrahman912 Jul 31, 2024
8a4523d
add 3d array for local matrices
Abdelrahman912 Aug 1, 2024
9617a4f
init code for v4
Abdelrahman912 Aug 1, 2024
427a6b0
fix bug w assemble global in v4
Abdelrahman912 Aug 5, 2024
bbed047
precommit fix
Abdelrahman912 Aug 5, 2024
85c055c
add preserve ref
Abdelrahman912 Aug 5, 2024
2b77613
fix precommit
Abdelrahman912 Aug 5, 2024
f9c70ab
fix logic error in v4
Abdelrahman912 Sep 7, 2024
0519016
init shared array usage
Abdelrahman912 Sep 9, 2024
5752676
optimize threads for dynamic shared memory threshold
Abdelrahman912 Sep 10, 2024
0fe023c
fix bug in dynamic shared mem
Abdelrahman912 Sep 11, 2024
a352612
minor fix
Abdelrahman912 Sep 11, 2024
2a6120a
init kernel abstractions
Abdelrahman912 Sep 16, 2024
67face7
add local matrix kernel
Abdelrahman912 Sep 16, 2024
aca8a6f
add global matrix kernel with CUDA dependency
Abdelrahman912 Sep 16, 2024
9e4d592
minor change
Abdelrahman912 Sep 16, 2024
6114495
init working KS implementation (still CUDA dependent )
Abdelrahman912 Sep 17, 2024
2a8abeb
remove cuda dependency
Abdelrahman912 Sep 18, 2024
630017c
add refrence to
Abdelrahman912 Sep 18, 2024
fc26670
use Atomix.jl
Abdelrahman912 Sep 20, 2024
ae7bc93
init v4 ks
Abdelrahman912 Sep 20, 2024
0e28f14
init cell cache prototype
Abdelrahman912 Sep 23, 2024
0eb376d
working gpu cell cache
Abdelrahman912 Sep 23, 2024
8f7a182
fix types
Abdelrahman912 Sep 23, 2024
9b1567d
init gpu cell iterator
Abdelrahman912 Sep 23, 2024
a08ab97
add iterator
Abdelrahman912 Sep 25, 2024
b34c43b
add stride kernel
Abdelrahman912 Sep 26, 2024
b289b69
minor fix
Abdelrahman912 Sep 26, 2024
b2c0347
fix blocks, threads for kernel launch
Abdelrahman912 Sep 27, 2024
b87d78b
minor fix for thread, blocks
Abdelrahman912 Sep 27, 2024
e10e2f6
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 3, 2024
42a28e1
add gpu as extension
Abdelrahman912 Oct 4, 2024
e59b8b8
add some documentaion and remove unnecessary implementations.
Abdelrahman912 Oct 7, 2024
e7157e4
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 10, 2024
e4b194d
init unit test
Abdelrahman912 Oct 10, 2024
a613107
init test for iterators
Abdelrahman912 Oct 11, 2024
113a7a2
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 11, 2024
d1e831e
add tests in GPU/
Abdelrahman912 Oct 11, 2024
7f8fa3c
add test local ke and fe
Abdelrahman912 Oct 11, 2024
c38419c
minor fix
Abdelrahman912 Oct 11, 2024
190e43e
fix ci - 1
Abdelrahman912 Oct 11, 2024
763c6b5
fix ci-2
Abdelrahman912 Oct 11, 2024
1b6060d
minor edit
Abdelrahman912 Oct 11, 2024
d767668
fix ci
Abdelrahman912 Oct 12, 2024
726ea9e
ci
Abdelrahman912 Oct 12, 2024
8590aa4
fix ci
Abdelrahman912 Oct 12, 2024
39e1f0c
minor edit
Abdelrahman912 Oct 12, 2024
f0cd305
add validation for cuda, minor fix, seperate unit tests into multiple…
Abdelrahman912 Oct 14, 2024
9d4e8b9
fix precommit shit
Abdelrahman912 Oct 14, 2024
12f64bb
try documentation test fix
Abdelrahman912 Oct 14, 2024
361333b
documentation test fix
Abdelrahman912 Oct 14, 2024
e31c6e3
make ci happy
Abdelrahman912 Oct 14, 2024
626dec2
change kernel launch, init adapt test
Abdelrahman912 Oct 15, 2024
fbc1b4b
minor fix
Abdelrahman912 Oct 15, 2024
ea83925
add test_adapt, some comments
Abdelrahman912 Oct 15, 2024
a356d8d
fix precommit
Abdelrahman912 Oct 15, 2024
ee1f77c
init cpu multi threading
Abdelrahman912 Nov 4, 2024
fb7e1fc
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 4, 2024
b38ab72
hot fix for buggy assembly logic
Abdelrahman912 Nov 5, 2024
adb166a
minor fix
Abdelrahman912 Nov 6, 2024
6300a4a
test sth
Abdelrahman912 Nov 6, 2024
b7301c2
precommit fix
Abdelrahman912 Nov 6, 2024
18f47b8
fix explicit imports
Abdelrahman912 Nov 6, 2024
f6e9cc6
add fillzero
Abdelrahman912 Nov 6, 2024
8a796de
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 6, 2024
75e89ed
minor fix for gpu assembly
Abdelrahman912 Nov 6, 2024
a77c347
minor minor fix
Abdelrahman912 Nov 6, 2024
7338788
make cache mutable
Abdelrahman912 Nov 12, 2024
cbab665
put the coloring stuff in the init
Abdelrahman912 Nov 12, 2024
1c81281
minor fix
Abdelrahman912 Nov 12, 2024
d42bcab
code for benchmarking (to be removed)
Abdelrahman912 Nov 13, 2024
1ab1650
rm cpu multithreading benchmark code
Abdelrahman912 Nov 13, 2024
bc8ec95
init fix for higher order approximations in gpu
Abdelrahman912 Nov 18, 2024
c7f4b0f
add working imp for global gpu mem
Abdelrahman912 Nov 18, 2024
d4d5967
add some comments
Abdelrahman912 Nov 18, 2024
3b2196b
trying to make the ci happy
Abdelrahman912 Nov 19, 2024
825d257
minor fix
Abdelrahman912 Nov 19, 2024
6109bd1
comment gpu related stuff in eg to pass ci
Abdelrahman912 Nov 19, 2024
9caa60b
some review fixes
Abdelrahman912 Nov 25, 2024
868d559
some review fixes
Abdelrahman912 Nov 25, 2024
a4637b6
add allocate_matrix for CuSparseMatrix
Abdelrahman912 Nov 26, 2024
1619986
init first ideas for cuda mem allocator
Abdelrahman912 Nov 28, 2024
69eb55a
add cuda mem interface
Abdelrahman912 Nov 28, 2024
ad09d08
minor fix
Abdelrahman912 Dec 2, 2024
801868b
first fix for global mem alloc
Abdelrahman912 Dec 2, 2024
81f932b
init fix for shared mem Alloc
Abdelrahman912 Dec 3, 2024
dd7868c
fix for keywords args bug
Abdelrahman912 Dec 3, 2024
441c9fb
init pre launch adaptation
Abdelrahman912 Dec 4, 2024
52f1479
minor fix
Abdelrahman912 Dec 4, 2024
05fb154
refactor mem allocate in cuda kernel launcher
Abdelrahman912 Dec 4, 2024
4381a76
minor changes
Abdelrahman912 Dec 4, 2024
63bbffe
fix tests
Abdelrahman912 Dec 5, 2024
57d01bf
minor fix
Abdelrahman912 Dec 5, 2024
1c806eb
add subdof
Abdelrahman912 Dec 31, 2024
0b5f381
fix ci
Abdelrahman912 Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ uuid = "c061ca5d-56c9-439f-9c0e-210fe06d3992"
version = "0.3.14"

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
EnumX = "4e289a0a-7415-4d19-859d-a7e5c4648b56"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
NearestNeighbors = "b8a86587-4115-5ab1-83bc-aa920d37bbce"
Expand Down
254 changes: 254 additions & 0 deletions docs/src/literate-tutorials/gpu_qp_heat_equation.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
using Ferrite, CUDA
using StaticArrays
using Adapt

left = Tensor{1,2,Float64}((0,-0)) # define the left bottom corner of the grid.
right = Tensor{1,2,Float64}((400.0,400.0)) # define the right top corner of the grid.
grid = generate_grid(Quadrilateral, (100, 100),left,right);



ip = Lagrange{RefQuadrilateral, 1}() # define the interpolation function (i.e. Bilinear lagrange)

# define the numerical integration rule
# (i.e. integrating over quad shape with two quadrature points per direction)
qr = QuadratureRule{RefQuadrilateral}(2)
cellvalues = CellValues(qr, ip);
static_cellvalues = Ferrite.StaticCellValues(cellvalues, Val(true));
size(static_cellvalues.fv.Nξ,1)
# Notes about cell values regarding gpu:
# 1. fun_values & geo_mapping in CellValues are not bits object. Therefore, they cannot be put on the gpu.
# 2. fv & gm in StaticCellValues are bits object. Therefore, they can be put on the gpu.
# 3. StaticCellValues can be a bitstype be reomoving x property from it.


dh = DofHandler(grid)
add!(dh, :u, ip)
close!(dh);



# Standard assembly of the element.
function assemble_element_std!(Ke::Matrix, fe::Vector, cellvalues::CellValues)
n_basefuncs = getnbasefunctions(cellvalues)

# Loop over quadrature points
for q_point in 1:getnquadpoints(cellvalues)
# Get the quadrature weight
dΩ = getdetJdV(cellvalues, q_point)
# Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(cellvalues, q_point, i)
∇δu = shape_gradient(cellvalues, q_point, i)
# Add contribution to fe
fe[i] += δu * dΩ
# Loop over trial shape functions
for j in 1:n_basefuncs
∇u = shape_gradient(cellvalues, q_point, j)
# Add contribution to Ke
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return Ke, fe
end


# Element assembly by using static cell (PR #883)
function assemble_element_qpiter!(Ke::Matrix, fe::Vector, cellvalues)
n_basefuncs = getnbasefunctions(cellvalues)
## Loop over quadrature points
for qv in Ferrite.QuadratureValuesIterator(cellvalues)
## Get the quadrature weight
dΩ = getdetJdV(qv)
## Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(qv, i)
∇δu = shape_gradient(qv, i)
## Add contribution to fe
fe[i] += δu * dΩ
## Loop over trial shape functions
for j in 1:n_basefuncs
∇u = shape_gradient(qv, j)
## Add contribution to Ke
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return Ke, fe
end


function create_buffers(cellvalues, dh)
f = zeros(ndofs(dh))
K = create_sparsity_pattern(dh)
assembler = start_assemble(K, f)
## Local quantities
n_basefuncs = getnbasefunctions(cellvalues)
Ke = zeros(n_basefuncs, n_basefuncs)
fe = zeros(n_basefuncs)
return (;f, K, assembler, Ke, fe)
end


# Standard global assembly
function assemble_global!(cellvalues, dh::DofHandler,qp_iter::Val{QPiter}) where {QPiter}
(;f, K, assembler, Ke, fe) = create_buffers(cellvalues,dh)
# Loop over all cels
for cell in CellIterator(dh)
fill!(Ke, 0)
fill!(fe, 0)
if QPiter
reinit!(cellvalues, getcoordinates(cell))
assemble_element_qpiter!(Ke, fe, cellvalues)
else
# Reinitialize cellvalues for this cell
reinit!(cellvalues, cell)
# Compute element contribution
assemble_element_std!(Ke, fe, cellvalues)
end
# Assemble Ke and fe into K and f
assemble!(assembler, celldofs(cell), Ke, fe)
end
return K, f
end



# Helper function to get all the coordinates from the grid.
function get_all_coordinates(grid::Ferrite.AbstractGrid{dim}) where {dim}
coords = Vector{Vec{2,Float32}}()
n_cells = length(grid.cells)
for i = 1:n_cells
append!(coords,getcoordinates(grid,i))
end
coords
end

struct GPUGrid{sdim,V<:Vec{sdim,Float32},COORDS<:AbstractArray{V,1}} <: Ferrite.AbstractGrid{sdim}
all_coords::COORDS
n_cells::Int32

end

function GPUGrid(grid::Grid{sdim}) where sdim
all_coords = cu(get_all_coordinates(grid))
n_cells = Int32(length(grid.cells))
GPUGrid(all_coords,n_cells)
end


struct GPUDofHandler{CDOFS<:AbstractArray{<:Number,1},GRID<:GPUGrid}<: Ferrite.AbstractDofHandler
cell_dofs::CDOFS
grid::GRID
end


function GPUDofHandler(dh::DofHandler)
GPUDofHandler(cu(Int32.(dh.cell_dofs)),GPUGrid(dh.grid))
end

function Adapt.adapt_structure(to, grid::GPUGrid)
all_coords = Adapt.adapt_structure(to, grid.all_coords)
n_cells = Adapt.adapt_structure(to, grid.n_cells)
GPUGrid(all_coords, n_cells)
end

function Adapt.adapt_structure(to, dh::GPUDofHandler)
cell_dofs = Adapt.adapt_structure(to, dh.cell_dofs)
grid = Adapt.adapt_structure(to, dh.grid)
GPUDofHandler(cell_dofs, grid)
end

function Adapt.adapt_structure(to, cv::Ferrite.StaticCellValues)
fv = Adapt.adapt_structure(to, cv.fv)
gm = Adapt.adapt_structure(to, cv.gm)
x = Adapt.adapt_structure(to, cu(cv.x))
weights = Adapt.adapt_structure(to, cv.weights)
Ferrite.StaticCellValues(fv,gm,x, weights)
end


gm = static_cellvalues.gm
termi-official marked this conversation as resolved.
Show resolved Hide resolved
x = get_all_coordinates(grid)
J = gm.dNdξ[1,1] ⊗ x[1] + gm.dNdξ[2,1] ⊗ x[2] + gm.dNdξ[3,1] ⊗ x[3] + gm.dNdξ[4,1] ⊗ x[4]
det(J)
inv_J = inv(J)
inv_J ⋅ static_cellvalues.fv.dNdξ[1,1]

function getjacobian(gm::Ferrite.StaticInterpolationValues,x,qr)
n_basefuncs = size(gm.Nξ,1)
J = gm.dNdξ[1,qr] ⊗ x[1]
for i = 2:n_basefuncs
J+= gm.dNdξ[i,qr] ⊗ x[i]
end
return J
end

function assemble_element_gpu!(Kgpu,cv::Ferrite.StaticCellValues,dh::GPUDofHandler)
tx = threadIdx().x
bx = blockIdx().x
bd = blockDim().x
e = tx + (bx-1)*bd
n_cells = dh.grid.n_cells
e ≤ n_cells || return nothing # e here is the current element index.
n_qr = length(cv.weights)
n_basefuncs = size(cv.fv.Nξ,1)
dofs = dh.cell_dofs
x = dh.grid.all_coords
for qr = 1:n_qr # loop over quadrature points # TODO: propogate_ibounds
si = (e-1)*n_basefuncs
#J = gm.dNdξ[1,qr] ⊗ x[si+1] + gm.dNdξ[2,qr] ⊗ x[si+2] + gm.dNdξ[3,qr] ⊗ x[si+3] + gm.dNdξ[4,qr] ⊗ x[si+4]
cell_x = @view x[si+1:si+size(cv.gm.Nξ,1)]
J = getjacobian(cv.gm, cell_x,qr)
inv_J = inv(J)
#@cushow det(J)
@inbounds dΩ = det(J) * cv.weights[qr]
for i = 1:n_basefuncs
@inbounds ∇δu = inv_J ⋅ cv.fv.dNdξ[i,qr]
for j = 1:n_basefuncs
@inbounds ∇u = inv_J ⋅ cv.fv.dNdξ[j,qr]
@inbounds ig = dofs[(e-1)*n_basefuncs+i]
@inbounds jg = dofs[(e-1)*n_basefuncs+j]
CUDA.@atomic Kgpu[ig, jg] += (∇δu ⋅ ∇u) * dΩ # atomic because many threads might write into the same memory addrres at the same time.
end
end
end
return nothing
end

Kgpu = CUDA.zeros(dh.ndofs.x,dh.ndofs.x)
gpu_dh = GPUDofHandler(dh)
termi-official marked this conversation as resolved.
Show resolved Hide resolved



function assemble_global_gpu!(Kgpu)
kernel = @cuda launch=false assemble_element_gpu!(Kgpu,static_cellvalues,gpu_dh)
config = launch_configuration(kernel.fun)
threads = min(length(grid.cells), config.threads)
blocks = cld(length(grid.cells), threads)
kernel(Kgpu,static_cellvalues,gpu_dh; threads=threads, blocks=blocks)
end

stassy(cv,dh) = assemble_global!(cv,dh,Val(false))

qpassy(cv,dh) = assemble_global!(cv,dh,Val(true))

using BenchmarkTools
using LinearAlgebra


assemble_global_gpu!(Kgpu)

Kgpu
norm(Kgpu)

Kstd , Fstd = stassy(cellvalues,dh);

norm(Kstd)

cvs = Ferrite.StaticCellValues(cellvalues, Val(true))

Kqp , Fqp = qpassy(cvs,dh);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing the analogue benchmark using QuadraturePointIterator

norm(Kqp)
Loading
Loading