Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirq 1.0 statevector return type requirements for simulators consume more RAM after a qsim run #6107

Open
rht opened this issue May 25, 2023 · 7 comments
Labels
kind/bug-report Something doesn't seem to work. status/needs-agreed-design We want to do this, but it needs an agreed upon design before implementation triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on

Comments

@rht
Copy link

rht commented May 25, 2023

Description of the issue
First reported in quantumlib/qsim#612. Cirq ~=1.0 requires qsimcirq to return its simulation output as a cirq.StateVectorTrialResult. In the current implementation, it causes an OOM when running a 32-qubit circuit on an a2-highgpu-1g, with a RAM of 85 GB. But it used to be not the case in qsimcirq 0.13.

In the specific case when the statevector is final (no further operations on the statevector are needed after the simulation), this construction is expensive as it requires, at one point, 3x-4x more RAM than is necessary. The allocations are:

  1. The C++ buffer of the statevector in the qsim layer. Is it scratch?
  2. The Python buffer of the statevector in the Cirq layer
  3. The simulation output viewed as an array of np.complex64 this view has been removed by @NoureldinYosri (https://github.com/quantumlib/qsim/blob/7b921299e53073e1f4e35c9b349dcf9655d76b63/qsimcirq/qsim_simulator.py#L561 in quantumlib/qsim@0009bc4) this shouldn't cause any extra RAM because it's just a view
  4. The copy of the simulation output

A quick modification on a live Cirq 1.1.0 install, where I removed the state_vector = state_vector.copy(), resulted in the OOM error gone. But it seems that the extra RAM consumption could be further reduced.

How to reproduce the issue
Steps to reproduce and the output can be found in quantumlib/qsim#612 (comment).

Cirq version
~= 1.0

cc: @daxfohl @95-martin-orion @sergeisakov

@rht rht added the kind/bug-report Something doesn't seem to work. label May 25, 2023
@rht rht changed the title Cirq 1.0 statevector return type requirements for simulators consume more RAM in qsim Cirq 1.0 statevector return type requirements for simulators consume more RAM after a qsim run May 25, 2023
@tanujkhattar tanujkhattar added triage/discuss Needs decision / discussion, bring these up during Cirq Cynque status/needs-agreed-design We want to do this, but it needs an agreed upon design before implementation labels Jun 7, 2023
@verult verult added triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on and removed triage/discuss Needs decision / discussion, bring these up during Cirq Cynque labels Jun 21, 2023
@rht
Copy link
Author

rht commented Oct 15, 2023

Solving this piecewise:

  • allocation 4 could be removed if cirq.StateVectorSimulationState has an extra argument inplace=True for the simulation output, which prevents the copy operation if enabled.
  • allocation 3 could be removed with qsim_state.astype(np.complex64, copy=False)

Ideally, the Python buffer should be gone, and we only have 1 buffer in C++, but this is not blocking the quick solution for allocation 3 and 4. This might be sufficient for our use case.

@rht
Copy link
Author

rht commented Oct 15, 2023

On 30 qubits, for this circuit, removing the copy operation reduces the elapsed:

from memory_profiler import memory_usage
import time

import cirq
import qsimcirq

def f():
    num_qubits = 30
    qc_cirq = cirq.Circuit()
    qubits = cirq.LineQubit.range(num_qubits)
    for i in range(num_qubits):
        qc_cirq.append(cirq.H(qubits[i]))
    sim = qsimcirq.QSimSimulator()
    tic = time.time()
    # sim = cirq.Simulator()
    sim.simulate(qc_cirq)
    print("Elapsed", time.time() - tic)
print("Max memory", max(memory_usage(f)))
# Before
Max memory 17241.0859375 MiB, 11.7 s
# After
Max memory 9045.91015625 MiB, elapsed 7.9 s

qsim_state.astype(np.complex64, copy=False) doesn't work, because qsim_state is an ndarray of floats, which are supposed to be reinterpreted as an ndarray of complexes. I suppose view is a memory view, and doesn't take any relevant space.

9 GiB is about 1 GiB more from an array of 2^30 np.complex64's. This likely means that allocation 1 and 2 don't exist at the same time in the steps.

Edit: benchmark was run on cuQuantum Appliance 23.06 (Cirq 1.1.0, qsimcirq 0.15.0)

@rht
Copy link
Author

rht commented Oct 15, 2023

The benchmark on cuQuantum Appliance 22.11 (Cirq 0.14.1, qsimcirq 0.12.1), before the large memory usage was introduced:

Max memory 8943.50390625, elapsed 7.8 s

@rht
Copy link
Author

rht commented Feb 7, 2024

Point no. 3 in the original post is not an allocation. It's just a view, but already has been removed by @NoureldinYosri in quantumlib/qsim@0009bc4.

@NoureldinYosri
Copy link
Collaborator

The return type of the qsim simulator is StateVectorTrialResult which indeed creates an extra buffer inorder to speed up operations on the resultant statevector.

if you just want the state vector you can use simulate_into_1d_array which will return the state vector as 1d np.array without any buffers. this should fix your memory issues. however your will lose the operations that StateVectorTrialResult implements or you will need to implement them yourself if they are not supported by the cirq routines.

@rht
Copy link
Author

rht commented Feb 7, 2024

Yeah, I'm aware of simulate_into_1d_array for my use case. The separate question is whether it is a long-term general solution, once NumPy removes its 32 dimensions limitation.

@NoureldinYosri
Copy link
Collaborator

I suppose the question is whether we want to create a version of StateVectorTrialResult that doesn't use buffers. this will reduce its memory footprint but at the cost of perfomance. StateVectorTrialResult was written with perfomance in mind so creating a version of it that uses less memory will hurt perfomance.

feel free to create a feature request for the unbuffered version of StateVectorTrialResult and we can discuss it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug-report Something doesn't seem to work. status/needs-agreed-design We want to do this, but it needs an agreed upon design before implementation triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on
Projects
None yet
Development

No branches or pull requests

4 participants