-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atb operator CUDA Out Of Memory error #617
Comments
Interesting... Indeed the operators (regardless of geometry being changed or not) should not take any memory, i.e. the footprint of the memory after calling them should be the same (or +image size, if the image variable was not created before the call). This could be a serious bug, but I have never experienced this before, so I wonder if there is a combination of OS/CUDA/etc version that is causing this go badly. To try to verify this, can you try to make even a smaller example? If this is caused by the TIGRE operators, it should also happen if you don't use your definitions of Also, could you do another test for me? Can you run this with fan-beam code? I understand the result will be wrong, but the codebase for parallel-beam and for fan-beam is different, and I wonder if there is a big in parallel beam, which is considerably less used by the community, so more prone to undetected bugs arising. |
A third question: Can you see which of Ax/ATb is the one causing this memory blow-up? You can test this but just running them in a loop, no need to have them in a mathematically correct loop (i.e. the gradient) |
Indeed my minimal code could have been more minimal :) Good catch, it seems that it's an issue with the Atb operator. The following code throws the same cuda oom error in the iteration 707 after 29 mins.
from functools import partial
import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm
def main():
gt = shepp_logan_phantom().astype(np.float32)[None, ...]
domain = gt.shape
hr_fov = (100, 100)
spacing = (50, 50)
PADDING = 1 # This is a percentage!
NANGLES = 1000
angles = np.linspace(0, 2 * np.pi, NANGLES)
centers = [(100, 300), (350, 100), (150, 200), (350, 350), (200, 100), (250, 150), (50, 250), (300, 50), (100, 150), (200, 350), (150, 50), (300, 300), (350, 200), (150, 300), (50, 150), (200, 200), (50, 100), (250, 250), (50, 350), (300, 150), (100, 250), (350, 50), (150, 150), (350, 300), (200, 50), (250, 100), (200, 300), (50, 200), (250, 350), (100, 100), (300, 250), (100, 350), (350, 150), (150, 250), (200, 150), (50, 50), (250, 200), (50, 300), (300, 100), (100, 200), (300, 350), (150, 100), (350, 250), (250, 50), (150, 350), (200, 250), (250, 300), (100, 50), (300, 200)]
print(centers)
N = len(centers)
x = np.zeros(domain, dtype=np.float32)
geo = tigre.geometry(mode="fan", nVoxel=np.array(x.shape))
ys = np.array([tigre.Ax(gt, geo, angles) for center in centers])
# Optimization parameters
learning_rate=5e-6
tolerance=1e-6
max_iterations=1000
lambda_weight=1
K = 10 # Simulates doing the optimization K times with different parameters
for _ in range(K):
pbar = tqdm(range(max_iterations))
for k in pbar:
bped = np.array([tigre.Atb(ys[i], geo, angles) for i in range(ys.shape[0])])
if __name__ == "__main__":
main() |
@dveni Thanks! It seems that it also happens with fan beam. Interesting. I will dig into this, hopefully I can find the issue. |
@dveni temporarily, you will find this line of code: TIGRE/Common/CUDA/voxel_backprojection2.cu Line 641 in 50a5c7b
In also the files https://github.com/CERN/TIGRE/blob/master/Common/CUDA/voxel_backprojection.cu and https://github.com/CERN/TIGRE/blob/master/Common/CUDA/voxel_backprojection_parallel.cu. If you uncomment it and recompile, it should fix the issue. The line restarts the GPU, which is not ideal and should not be used if the GPU is being used in parallel by other processes or if its being used for ML, but if you are just doing something like the code you showed, restarting it should clean up the memory even if its badly freed. |
Well, it's precisely being used for ML with as many processes as possible to fill up the vram. The code was just a dummy example :) I'll have to wait then, let me know if I can help |
@dveni sure, I'll let you know if I can find any issue. Just FYI, we could not see issues when trying the new pytorch bindings: https://github.com/CERN/TIGRE/blob/master/Python/demos/d25_Pytorch.py . I would anyway suggest you upgrade your TIGRE to v3.0, its been many changes since. I don't remember any memory leak being part of it, but better to be up to date. Edit: albeit I just seen that the pytigre version has not been updated. Opps. Lines 509 to 510 in 50a5c7b
|
I'll check out the pytorch bindings, thanks, looks promising :) |
@dveni I just did a test in my machine and indeed I can see an increase of memory footprint of the example code you sent me, but its a very minimal one. After the 1000 iterations it did indeed use around 1GB of RAM on the CPU side. This is which is quite significant, but I had around 14GB of RAM left before it crashed, exactly on iteration 707. This is suspicious, that we got exactly the same number. |
Specifically at iteration 707 and subiteration 34 (of |
@dveni I am completely stumped by this, I have already looked at it for so many hours. Interestingly, its not a trivial "out of memory". I found exactly where the error happens and in my machine there is still 11GB free when it happens. Digging further, but you found a weird weird one. I have even asked in Stackoverflow: https://stackoverflow.com/questions/79257953/alternative-reasons-for-out-of-memory-error-than-lack-of-free-global-memory |
This may be caused by memory fragmentation. This means that I don't have a clear way to fix it. If the issue is memory fragmentation, then this may not have a solution unless I restructure the entire TIGRE. Ah, this was all coded pre-deep learning revolution, so assuming someone would want to call the operators 50K times was absurd! hehe. Ultimately, if I can't find another solution and the pytorch bindings I linked also cause the issue, the best I can suggest is you use tomosipo. I use it in LION, the tomographic recon+AI library we are building, and it never had any issue. |
Hi Ander! Thanks a lot for digging into this. I have tried the pytorch bindings but, unfortunately, I have the very same issue. This code snippet breaks at iteration 707 (again) after 14'50'': import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm
from tigre.utilities.pytorch_bindings import create_pytorch_operator
import torch
import tigre.utilities.gpu as gpu
print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
def main():
gt = shepp_logan_phantom().astype(np.float32)[None, ...]
domain = gt.shape
hr_fov = (100, 100)
spacing = (50, 50)
PADDING = 1 # This is a percentage!
NANGLES = 1000
angles = np.linspace(0, 2 * np.pi, NANGLES)
centers = [(100, 300), (350, 100), (150, 200), (350, 350), (200, 100), (250, 150), (50, 250), (300, 50), (100, 150), (200, 350), (150, 50), (300, 300), (350, 200), (150, 300), (50, 150), (200, 200), (50, 100), (250, 250), (50, 350), (300, 150), (100, 250), (350, 50), (150, 150), (350, 300), (200, 50), (250, 100), (200, 300), (50, 200), (250, 350), (100, 100), (300, 250), (100, 350), (350, 150), (150, 250), (200, 150), (50, 50), (250, 200), (50, 300), (300, 100), (100, 200), (300, 350), (150, 100), (350, 250), (250, 50), (150, 350), (200, 250), (250, 300), (100, 50), (300, 200)]
print(centers)
N = len(centers)
x = np.zeros(domain, dtype=np.float32)
tigre_devices = gpu.getGpuIds()
local_geo = tigre.geometry(mode="fan", nVoxel=np.array(x.shape))
ax, atb = create_pytorch_operator(local_geo, angles, tigre_devices)
ys = np.array([ax(torch.from_numpy(gt)).detach().cpu().numpy() for center in centers])
# Optimization parameters
learning_rate=5e-6
tolerance=1e-6
max_iterations=1000
lambda_weight=1
K = 10 # Simulates doing the optimization K times with different parameters
for _ in range(K):
pbar = tqdm(range(max_iterations))
for k in pbar:
bped = np.array([atb(torch.from_numpy(ys[i])).detach().cpu().numpy() for i in range(ys.shape[0])])
if __name__ == "__main__":
main() Checking the memory traces in the mem profile it looks like the pytorch binding calls Again, thanks for your help :) |
Indeed it uses the same function. Interestingly, in that stackoverflow post the person who answered is an NVIDIA employee, and they suggest that its plausible that you found a bug in the memory management of CUDA. I'll let you know if that leads to anything. |
Background
I'm simulating a local tomography setup where the region of interest must be covered with
N
scans. I implemented an iterative gradient-based reconstruction using theAx
andAtb
operators to solve the following minimization problem:where$\mathcal{X}$ is the feasible set, $y_i$ the observation obtained with the $A_i$ forward operator (i.e. a sample is scanned N times with the forward operator
A_i
corresponding to the local ROIi
.Starting from an initial estimate, we iterate a number of iterations
$$\mathbf{x}^{k+1} = \mathbf{x}^{k} - \eta \nabla_{\textbf{x}}\mathcal{L}(x^k)$$
K
updating the estimate at every step. Each update does:where the gradient is
$$\nabla_{\textbf{x}}\mathcal{L}(x) = \frac{-1}{N} \sum_{i=1}^{N} A^T_i(y_i - A_i(\mathbf{x}))$$
and where$A^T_i$ is the adjoint (back-projection) operator.
Expected Behavior
Memory usage is stable regardless of the number of iterations used.
Actual Behavior
Memory grows when using the operator iteratively (heap size grows linearly with the number of iterations while the resident size remains stable according to a quick memray profiling). As the point of Tigre is making it easier to use iterative algorithms, I must be messing up somewhere but have not managed to find where so far. I suspected that defining a new geometry every time the operators are used could be the issue (hence the
ax_geos
andatb_geos
dictionaries), but that did not change the behavior. It looks like objects are accumulating in memory despite not being used anymore. I have tried using theIterativeReconAlg
class as a reference but I have not been able to find the issue in my code.Any pointers to where I could be keeping things in memory?
Code to reproduce the problem (If applicable)
A minimal code example naively implementing this setting that will break in the iteration 707 after 1h18' in a V100 with the error:
Specifications
The text was updated successfully, but these errors were encountered: