PCG produces different output on GPU #724

zjwegert · 2023-04-12T09:04:38Z

Describe the bug

It appears that a Jacobi preconditioned CG method produces different output on a GPU when re-run.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, CUDA.CUSPARSE, SparseArrays, LinearAlgebra, Krylov, Random

function generate_matrix(n)
  A=sprand(n,n,0.1); 
  A=SparseMatrixCSC(0.5*(A+A') + 0.005*n*I(n));
  b=rand(n);
  A,b
end

function krylov_inv(A::SparseMatrixCSC{M,V},B::Vector{M}) where {M,V,N}
    # Jacobi preconditioner
    d = [A[i,i] ≠ 0 ? 1 / abs(A[i,i]) : 1 for i ∈ axes(A,1)]
    P⁻¹ = spdiagm(d)

    x,~ = Krylov.cg(A, B, M=P⁻¹; rtol=10^-13)
    return Array(x)
end

function krylov_inv_gpu(A::SparseMatrixCSC{M,V},B::Vector{M}) where {M,V,N}
    A_gpu = CuSparseMatrixCSC(A);
    B_gpu = CuArray(B)

    # Jacobi preconditioner
    d = [A[i,i] ≠ 0 ? 1 / abs(A[i,i]) : 1 for i ∈ axes(A,1)]
    P⁻¹ = CuSparseMatrixCSC(spdiagm(d))

    x,~ = Krylov.cg(A_gpu, B_gpu, M=P⁻¹; rtol=10^-13)
    return Array(x)
end

# Random.seed!(3);
A,b=generate_matrix(5000);
isposdef(A) # ---> True
issymmetric(A) # ---> True
x = krylov_inv(A,b);
x2 = krylov_inv(A,b);
norm(x-x2,Inf) # ---> 0.0
gpu_x = krylov_inv_gpu(A,b);
gpu_x2 = krylov_inv_gpu(A,b);
norm(gpu_x-gpu_x2,Inf) # ---> Non-zero

Expected behavior

Expect norm(gpu_x-gpu_x2,Inf) = 0.

Version info

Details on Julia:

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) W-2265 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 24 virtual cores

Package versions:

  [052768ef] CUDA v4.1.3
  [ba0b0d4f] Krylov v0.9.0

Details on CUDA:

CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 525.105.17, originally for CUDA 12.0

Libraries: 
- CUBLAS: 12.1.0
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.4
- CUSPARSE: 12.0.2
- CUPTI: 18.0.0
- NVML: 12.0.0+525.105.17

Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA RTX A4500 (sm_86, 19.237 GiB / 19.990 GiB available)

Additional info:
I've tested this on another system and the issue still presents.

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Core(TM) i7-9800X CPU @ 3.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores

&

CUDA runtime 11.8, artifact installation
CUDA driver 11.4
NVIDIA driver 470.42.1

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+470.42.1

Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: Quadro GP100 (sm_60, 15.313 GiB / 15.897 GiB available)
  1: Quadro P620 (sm_61, 1.470 GiB / 1.945 GiB available)

The text was updated successfully, but these errors were encountered:

amontoison · 2023-04-12T12:36:16Z

Hi @zjwegert, can you re-run the tests with CUDA toolkit 11.8?
CUDA.set_runtime_version!(v"11.8").
The CUSPARSE library of CUDA toolkits v"12.x" is unstable.

Is it also possible to see the logs with the option history=1? Matrix-vector products are multithreaded on GPU and can give different results if A is ill-conditioned but PCG should require almost the same number of iterations to find the solution.

zjwegert · 2023-04-12T22:46:05Z

Hi @amontoison,
I've re-run the problems (on the latter system) with the suggested version. Please see output below:

julia> x = krylov_inv(A,b);
┌ Info: Simple stats
│  niter: 25
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  2.4e-08  2.9e-08  7.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> x2 = krylov_inv(A,b);
┌ Info: Simple stats
│  niter: 25
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  2.4e-08  2.9e-08  7.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> norm(x-x2,Inf) # ---> 0.0
0.0

julia> gpu_x = krylov_inv_gpu(A,b);

┌ Info: Simple stats
│  niter: 24
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  4.9e-08  2.0e-08  1.0e-08 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> gpu_x2 = krylov_inv_gpu(A,b);
┌ Info: Simple stats
│  niter: 24
│  solved: true
│  inconsistent: false
│  residuals: [ 8.1e+00  4.1e+00  1.3e+00 ...  4.9e-08  2.0e-08  8.9e-09 ]
│  Aresiduals: []
│  κ₂(A): []
└  status: solution good enough given atol and rtol

julia> norm(gpu_x-gpu_x2,Inf) # ---> Non-zero
1.9215254387638936e-12

The condition number for the matrix in this computation is

julia> cond(Array(A),2)
37.067459615697715

amontoison · 2023-04-13T00:06:16Z

The observed phenomenon is related to the parallelism of GPUs.
If we run the code on CPU, the sparse matrix-vector products are not multihreaded by default in Julia and mul!(y, A, x) will always return the same vector. If you use MKLSparse.jl or the function threaded_mul!, you will probably have a different solution if you re-run cg.

An example to explain it is to generate a matrix of size 5 x n where each row are equals and the coefficients of the rows are (1, 1/2, ..., 1/n). If v = (1, ..., 1), the components of y = A * v are not equals (floating point absorption phenomena).

using LinearAlgebra, CUDA, CUDA.CUSPARSE

n = 10000
T = Float64
A_cpu = zeros(T, 5, n)
for i = 1:5
  for j = 1:n
    A_cpu[i,j] = 1/j
  end
end
A_cpu = sparse(A_cpu)
v_cpu = ones(T, n)
y_cpu = A_cpu * v_cpu

A_gpu = CuSparseMatrixCSC(A_cpu)
v_gpu = CuVector(v_cpu)
y_gpu = A_gpu * v_gpu

5-element Vector{Float64}:
 9.787606036044348
 9.787606036044348
 9.787606036044348
 9.787606036044348
 9.787606036044348

5-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 9.787606036044377
 9.787606036044378
 9.787606036044371
 9.787606036044373
 9.787606036044366

cc @PierreYvesBouchet

amontoison · 2023-05-11T23:37:45Z

@zjwegert May I close this issue?

zjwegert · 2023-05-11T23:40:14Z

Sure. Thanks for the discussion above. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Alexis Montoison ***@***.***> Sent: Friday, May 12, 2023 9:37:56 AM To: JuliaSmoothOptimizers/Krylov.jl ***@***.***> Cc: Zach Wegert ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaSmoothOptimizers/Krylov.jl] PCG produces different output on GPU (Issue #724) @zjwegert<https://urldefense.com/v3/__https://github.com/zjwegert__;!!NVzLfOphnbDXSw!Hr-t54FIqsShKYKLhRPHy_GetPVDTacECFgtH0uBf7JFKKOqwIPWA0jnHZVsJCDKWJIw7e3TNVGggHwc00k1Y92ucPB55XuNUA$> May I close this issue? — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/JuliaSmoothOptimizers/Krylov.jl/issues/724*issuecomment-1544845064__;Iw!!NVzLfOphnbDXSw!Hr-t54FIqsShKYKLhRPHy_GetPVDTacECFgtH0uBf7JFKKOqwIPWA0jnHZVsJCDKWJIw7e3TNVGggHwc00k1Y92ucPDLHQrZKQ$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AOOWL4PUCJ6YH5SOY4OPK5LXFVZ5JANCNFSM6AAAAAAW3MJMFQ__;!!NVzLfOphnbDXSw!Hr-t54FIqsShKYKLhRPHy_GetPVDTacECFgtH0uBf7JFKKOqwIPWA0jnHZVsJCDKWJIw7e3TNVGggHwc00k1Y92ucPAQoXr0aQ$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

amontoison closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCG produces different output on GPU #724

PCG produces different output on GPU #724

zjwegert commented Apr 12, 2023

amontoison commented Apr 12, 2023 •

edited

Loading

zjwegert commented Apr 12, 2023

amontoison commented Apr 13, 2023 •

edited

Loading

amontoison commented May 11, 2023

zjwegert commented May 11, 2023 via email

PCG produces different output on GPU #724

PCG produces different output on GPU #724

Comments

zjwegert commented Apr 12, 2023

amontoison commented Apr 12, 2023 • edited Loading

zjwegert commented Apr 12, 2023

amontoison commented Apr 13, 2023 • edited Loading

amontoison commented May 11, 2023

zjwegert commented May 11, 2023 via email

amontoison commented Apr 12, 2023 •

edited

Loading

amontoison commented Apr 13, 2023 •

edited

Loading