Skip to content

Segmentation Fault with Multi GPU #2821

@tchaton

Description

@tchaton

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

I am trying to run this script on multi GPU and got segmentation fault. I have been using Lightning Studio with 4 x L4 Studio.

import cudaq
from cudaq import spin
import numpy as np
import time
import faulthandler
faulthandler.enable()
np.random.seed(1)
qubit_count = 5
sample_count = 10000
h = spin.z(0)
parameter_count = qubit_count
# prepare 10000 different input parameter sets.
parameters = np.random.default_rng(13).uniform(low=0,
                                               high=1,
                                               size=(sample_count,
                                                     parameter_count))
@cudaq.kernel
def kernel(params: list[float]):
    qubits = cudaq.qvector(5)
    for i in range(5):
        rx(params[i], qubits[i])
def build_kernel():
    k, params = cudaq.make_kernel(list[float])
    qubits = k.qalloc(5)
    k.x(qubits[0])
    k.x(qubits[1])
    for i in range(5):
        k.rx(params[i], qubits[i])
    return k
print('There are', parameters.shape[0], 'parameter sets to execute')
xi = np.split(
    parameters,
    4)  # Split the parameters into 4 arrays since 4 GPUs are available.
print('Split parameters into', len(xi), 'batches of', xi[0].shape[0], ',',
      xi[1].shape[0], ',', xi[2].shape[0], ',', xi[3].shape[0])
      # Timing the execution on a single GPU vs 4 GPUs,
# one will see a nearly 4x performance improvement if 4 GPUs are available.
#print(build_kernel())
#print(kernel([0.1, 0.2, 0.3, 0.4, 0.5]))
if cudaq.num_available_gpus() == 0:
    cudaq.set_target("qpp-cpu", option="fp64")
elif cudaq.num_available_gpus() == 1:
    cudaq.set_target("nvidia", option="fp64")
else:
    print("Chosen MultiGPU")
    cudaq.set_target("nvidia", option="mqpu")
asyncresults = []
num_gpus = cudaq.num_available_gpus()
start_time = time.time()
for i in range(1):
    for j in range(xi[i].shape[0]):
        qpu_id = i * num_gpus // len(xi)
        ker = build_kernel()
        asyncresults.append(
            cudaq.observe_async(ker, h, xi[i][j, :], qpu_id=qpu_id))
result = [res.get() for res in asyncresults]
end_time = time.time()
print(end_time - start_time)
~ python main.py
There are 10000 parameter sets to execute
Split parameters into 4 batches of 2500 , 2500 , 2500 , 2500
Chosen MultiGPU
Fatal Python error: Segmentation fault

Thread 0x00007de4d29aa740 (most recent call first):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/cudaq/mlir/dialects/_func_ops_ext.py", line 77 in type
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/cudaq/mlir/dialects/_func_ops_ext.py", line 102 in add_entry_block
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/cudaq/kernel/kernel_builder.py", line 294 in __init__
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/cudaq/kernel/kernel_builder.py", line 1592 in make_kernel
  File "/teamspace/studios/this_studio/main.py", line 26 in build_kernel
  File "/teamspace/studios/this_studio/main.py", line 56 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, cuquantum.bindings._utils, cuquantum.bindings._internal.cudensitymat, cuquantum.bindings.cycudensitymat, cupy_backends.cuda._softlink, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy_backends.cuda.stream, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.pinned_memory, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy_backends.cuda.libs.cutensor, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.fft._cache, cupy.fft._callback, cupy.random._bit_generator, cuquantum.bindings._internal.cutensornet, cuquantum.bindings.cycutensornet, cuquantum.bindings.cutensornet, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cupyx.cutensor, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cuquantum.bindings.cudensitymat, cuquantum.bindings._internal.custatevec, cuquantum.bindings.cycustatevec, cuquantum.bindings.custatevec, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 148)
/commands/python: line 37: 42648 Segmentation fault      (core dumped) "$@"

Steps to reproduce the bug

Create a Lightning Studio with 4 x L4. We are happy to cover the cost there. Reach out to thomas@lightning.ai to make this happen.

Installation step:

export OMPI_MCA_opal_cuda_support=true OMPI_MCA_btl='^openib'
cuda_version=12.2.0 # set this variable to version 11.x (where x >= 8) or 12.x
conda install -y -n cudaq-env -c "nvidia/label/cuda-${cuda_version}" cuda
conda install -y -n cudaq-env -c conda-forge mpi4py openmpi">=5.0.3" cxx-compiler
pip install cudaq
export MPI_PATH=/home/zeus/miniconda3/envs/cloudspace
/home/zeus/miniconda3/envs/cloudspace/bin/x86_64-conda-linux-gnu-c++ -shared -fPIC -o mpi_plugin.so mpi_comm_impl.cpp $(mpicc --showme:compile) $(mpicc --showme:link)
export CONDA_PREFIX=/home/zeus/miniconda3/envs/cloudspace
cd $CONDA_PREFIX/lib/python3.10/site-packages/distributed_interfaces && source activate_custom_mpi.sh

export OMPI_MCA_btl_base_verbose=100
export OMPI_MCA_btl=self,vader,tcp
export OMPI_MCA_pml=ob1

MPI is working fine.

Expected behavior

No segmentation fault.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA-Q version: 0.10.0
  • Python version: 3.10.13
  • C++ compiler: 13.3.0
  • Operating system: 20.04.6 LTS (Focal Fossa)
    ⚡ ~ mpic++ --showme:command
    x86_64-conda-linux-gnu-c++

Suggestions

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions