-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM aarch64 relocation overflow #1421
Comments
Thank you very much for this bug report, @Kenny-Heitritter. We just released version 0.7.0, can you please tell us if the problem is more likely or less likely to occur on 0.7.0? There are no direct fixes for this issue in 0.7.0, but the timing likely changed, so it would be good to know if we should focus our debug efforts on a specific version or not. A few items to note:
--- test_0.6.0.py 2024-03-20 13:21:03.138949476 +0000
+++ test_0.7.0.py 2024-03-20 13:31:09.739183293 +0000
@@ -128,7 +128,7 @@
for i in range(nelec):
kernel.x(qubits[i])
-cudaq.kernels.uccsd(kernel, qubits, thetas, nelec, qubits_num)
+kernel.apply_call(cudaq.kernels.uccsd, qubits, thetas, nelec, qubits_num)
parameter_count = cudaq.kernels.uccsd_num_parameters(nelec,qubits_num) |
Thanks @bmhowe23! Just tested the same test script, modulo the new UCCSD API shown above, and it does appear the issue is present to the same degree in 0.7.0. Please do let me know if there are any other tests I can run which would be helpful. |
@Kenny-Heitritter I am still trying to reproduce the issue on servers that I have access to (unsuccessfully thus far), but if you would like to try |
The old link expired, so here is a new one: https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/200241787?tag=pr-1444-base |
@Kenny-Heitritter We've seen some positive results from this image and will likely include the change in this image in our next release. Feel free to test it out if you'd like: https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/235623747?tag=pr-1444-base. (Thanks @jfriel-oqc!) |
Hi @bmhowe23, I can confirm that the issue is resolved when upgrading to a 0.8.0 container. However, I still face this issue on a GH200 Grace Hopper when installing CUDA-Q through Conda. Basic installing steps for a single node: name="cq_arm"
conda create -y -n $name -c conda-forge python=3.10 pip
conda install -y -n $name -c "nvidia/label/cuda-11.8.0" cuda
conda install -y -n $name -c conda-forge mpi4py openmpi cxx-compiler
conda run -n $name pip install cuda-quantum
conda activate $name
conda env config vars set -n $name LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
conda deactivate
conda activate $name Env info:
Example program source: #!/usr/bin/env python3
import sys
import cudaq
print(f"Running on target {cudaq.get_target().simulator}")
qubit_count = int(sys.argv[1]) if 1 < len(sys.argv) else 2
@cudaq.kernel
def kernel():
qubits = cudaq.qvector(qubit_count)
h(qubits[0])
for i in range(0, qubit_count-1):
x.ctrl(qubits[i], qubits[i+1])
mz(qubits)
result = cudaq.sample(kernel)
if (not cudaq.mpi.is_initialized()) or (cudaq.mpi.rank() == 0):
print(result) # Example: { 11:500 00:500 } The crash probability is about 10%. Experimental data suggest no correlation between the number of simulated qubits and the chance of crashing. |
@bmhowe23 This problem seems to be making a comeback. As of testing with cudaq version 0.9.0, I am seeing this error come up with a relatively high frequency on our GH200. Environment |
@Kenny-Heitritter - thanks for the information. Are you using the Docker image or the Python wheels? |
@bmhowe23 I will chime in since I am affected by this problem as well. It only happens when using Python wheels. The crash occurs very frequently, more often than I had previously reported. I am using the same example program (ghz.py) that I reported before. I am testing with the following script: #!/usr/bin/env bash
declare -A exit_codes=()
for round in {1..100}
do
python ghz.py 2 --target nvidia > /dev/null 2>&1
exit_code=$?
if [[ -v exit_codes[$exit_code] ]]; then
((exit_codes[$exit_code]++))
else
exit_codes[$exit_code]=1
fi
done
echo -e "OK: ${exit_codes[0]}\nKO: ${exit_codes[134]}" I ran my script using three approaches:
|
@bebora, @Kenny-Heitritter - thanks again for bringing this to our attention. The issue should be resolved w/ #2504. We will likely go through a full release that includes this PR in a few weeks. |
Required prerequisites
Describe the bug
When running VQEs requiring larger amounts of memory from within the CUDA Quantum docker container (v0.6.0) on NVIDIA GH200, there is an increasing chance of getting the following error:
Steps to reproduce the bug
Expected behavior
The code should run without producing an error.
Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
Suggestions
No response
The text was updated successfully, but these errors were encountered: