Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCG produces different output on GPU #724

Closed
zjwegert opened this issue Apr 12, 2023 · 5 comments
Closed

PCG produces different output on GPU #724

zjwegert opened this issue Apr 12, 2023 · 5 comments

Comments

@zjwegert
Copy link

Describe the bug

It appears that a Jacobi preconditioned CG method produces different output on a GPU when re-run.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, CUDA.CUSPARSE, SparseArrays, LinearAlgebra, Krylov, Random

function generate_matrix(n)
  A=sprand(n,n,0.1); 
  A=SparseMatrixCSC(0.5*(A+A') + 0.005*n*I(n));
  b=rand(n);
  A,b
end

function krylov_inv(A::SparseMatrixCSC{M,V},B::Vector{M}) where {M,V,N}
    # Jacobi preconditioner
    d = [A[i,i]  0 ? 1 / abs(A[i,i]) : 1 for i  axes(A,1)]
    P⁻¹ = spdiagm(d)

    x,~ = Krylov.cg(A, B, M=P⁻¹; rtol=10^-13)
    return Array(x)
end

function krylov_inv_gpu(A::SparseMatrixCSC{M,V},B::Vector{M}) where {M,V,N}
    A_gpu = CuSparseMatrixCSC(A);
    B_gpu = CuArray(B)

    # Jacobi preconditioner
    d = [A[i,i]  0 ? 1 / abs(A[i,i]) : 1 for i  axes(A,1)]
    P⁻¹ = CuSparseMatrixCSC(spdiagm(d))

    x,~ = Krylov.cg(A_gpu, B_gpu, M=P⁻¹; rtol=10^-13)
    return Array(x)
end

# Random.seed!(3);
A,b=generate_matrix(5000);
isposdef(A) # ---> True
issymmetric(A) # ---> True
x = krylov_inv(A,b);
x2 = krylov_inv(A,b);
norm(x-x2,Inf) # ---> 0.0
gpu_x = krylov_inv_gpu(A,b);
gpu_x2 = krylov_inv_gpu(A,b);
norm(gpu_x-gpu_x2,Inf) # ---> Non-zero

Expected behavior

Expect norm(gpu_x-gpu_x2,Inf) = 0.

Version info

Details on Julia:

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) W-2265 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 24 virtual cores

Package versions:

  [052768ef] CUDA v4.1.3
  [ba0b0d4f] Krylov v0.9.0

Details on CUDA:

CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 525.105.17, originally for CUDA 12.0

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+525.105.17

Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA RTX A4500 (sm_86, 19.237 GiB / 19.990 GiB available)

Additional info:
I've tested this on another system and the issue still presents.

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores

&

CUDA runtime 11.8, artifact installation
CUDA driver 11.4
NVIDIA driver 470.42.1

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+470.42.1

Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: Quadro GP100 (sm_60, 15.313 GiB / 15.897 GiB available)
  1: Quadro P620 (sm_61, 1.470 GiB / 1.945 GiB available)
@amontoison
Copy link
Member

amontoison commented Apr 12, 2023

Hi @zjwegert, can you re-run the tests with CUDA toolkit 11.8?
CUDA.set_runtime_version!(v"11.8").
The CUSPARSE library of CUDA toolkits v"12.x" is unstable.

Is it also possible to see the logs with the option history=1? Matrix-vector products are multithreaded on GPU and can give different results if A is ill-conditioned but PCG should require almost the same number of iterations to find the solution.

@zjwegert
Copy link
Author

Hi @amontoison,
I've re-run the problems (on the latter system) with the suggested version. Please see output below:

julia> x = krylov_inv(A,b);
┌ Info: Simple stats
│  niter: 25
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  2.4e-08  2.9e-08  7.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> x2 = krylov_inv(A,b);
┌ Info: Simple stats
│  niter: 25
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  2.4e-08  2.9e-08  7.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> norm(x-x2,Inf) # ---> 0.0
0.0

julia> gpu_x = krylov_inv_gpu(A,b);

┌ Info: Simple stats
│  niter: 24
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  4.9e-08  2.0e-08  1.0e-08 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> gpu_x2 = krylov_inv_gpu(A,b);
┌ Info: Simple stats
│  niter: 24
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  4.9e-08  2.0e-08  8.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> norm(gpu_x-gpu_x2,Inf) # ---> Non-zero
1.9215254387638936e-12

The condition number for the matrix in this computation is

julia> cond(Array(A),2)
37.067459615697715

@amontoison
Copy link
Member

amontoison commented Apr 13, 2023

The observed phenomenon is related to the parallelism of GPUs.
If we run the code on CPU, the sparse matrix-vector products are not multihreaded by default in Julia and mul!(y, A, x) will always return the same vector. If you use MKLSparse.jl or the function threaded_mul!, you will probably have a different solution if you re-run cg.

An example to explain it is to generate a matrix of size 5 x n where each row are equals and the coefficients of the rows are (1, 1/2, ..., 1/n). If v = (1, ..., 1), the components of y = A * v are not equals (floating point absorption phenomena).

using LinearAlgebra, CUDA, CUDA.CUSPARSE

n = 10000
T = Float64
A_cpu = zeros(T, 5, n)
for i = 1:5
  for j = 1:n
    A_cpu[i,j] = 1/j
  end
end
A_cpu = sparse(A_cpu)
v_cpu = ones(T, n)
y_cpu = A_cpu * v_cpu

A_gpu = CuSparseMatrixCSC(A_cpu)
v_gpu = CuVector(v_cpu)
y_gpu = A_gpu * v_gpu
5-element Vector{Float64}:
 9.787606036044348
 9.787606036044348
 9.787606036044348
 9.787606036044348
 9.787606036044348

5-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 9.787606036044377
 9.787606036044378
 9.787606036044371
 9.787606036044373
 9.787606036044366

cc @PierreYvesBouchet

@amontoison
Copy link
Member

@zjwegert May I close this issue?

@zjwegert
Copy link
Author

zjwegert commented May 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants