Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atb operator CUDA Out Of Memory error #617

Open
dveni opened this issue Dec 5, 2024 · 14 comments
Open

Atb operator CUDA Out Of Memory error #617

dveni opened this issue Dec 5, 2024 · 14 comments

Comments

@dveni
Copy link

dveni commented Dec 5, 2024

Background

I'm simulating a local tomography setup where the region of interest must be covered with N scans. I implemented an iterative gradient-based reconstruction using the Ax and Atb operators to solve the following minimization problem:

$$\mathbf{x}^* = argmin_{\mathbf{x}} \quad \frac{1}{N} \sum_{i=1}^{N} \lVert y_i - A_i(\mathbf{x})\lVert ^2 \quad \text{s.t.} \quad \mathbf{x^*} \in \mathcal{X}$$

where $\mathcal{X}$ is the feasible set, $y_i$ the observation obtained with the $A_i$ forward operator (i.e. a sample is scanned N times with the forward operator A_i corresponding to the local ROI i.

Starting from an initial estimate, we iterate a number of iterations K updating the estimate at every step. Each update does:
$$\mathbf{x}^{k+1} = \mathbf{x}^{k} - \eta \nabla_{\textbf{x}}\mathcal{L}(x^k)$$

where the gradient is
$$\nabla_{\textbf{x}}\mathcal{L}(x) = \frac{-1}{N} \sum_{i=1}^{N} A^T_i(y_i - A_i(\mathbf{x}))$$

and where $A^T_i$ is the adjoint (back-projection) operator.

Expected Behavior

Memory usage is stable regardless of the number of iterations used.

Actual Behavior

Memory grows when using the operator iteratively (heap size grows linearly with the number of iterations while the resident size remains stable according to a quick memray profiling). As the point of Tigre is making it easier to use iterative algorithms, I must be messing up somewhere but have not managed to find where so far. I suspected that defining a new geometry every time the operators are used could be the issue (hence the ax_geos and atb_geos dictionaries), but that did not change the behavior. It looks like objects are accumulating in memory despite not being used anymore. I have tried using the IterativeReconAlg class as a reference but I have not been able to find the issue in my code.

Any pointers to where I could be keeping things in memory?

Code to reproduce the problem (If applicable)

A minimal code example naively implementing this setting that will break in the iteration 707 after 1h18' in a V100 with the error:

../Common/CUDA/TIGRE_common.cpp (7): Texture object creation fail
../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb out of memory 
from functools import partial
import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm


def pad_around_center(image: np.ndarray, center: tuple):
    """This function pads an image around a given center coordinate so that the image has equal dimensions in every direction from the center"""
    squeezed = False
    if image.ndim == 3:
        image = image[0]
        squeezed = True
    distance_to_left = center[1]
    distance_to_right = image.shape[1] - center[1]
    left_padding = int(max(0, distance_to_right - distance_to_left))
    right_padding = int(max(0, distance_to_left - distance_to_right))

    distance_to_top = center[0]
    distance_to_bottom = image.shape[0] - center[0]
    top_padding = int(max(0, distance_to_bottom - distance_to_top))
    bottom_padding = int(max(0, distance_to_top - distance_to_bottom))

    # print(f"Center: {center}")
    # print(f"Distances - Left: {distance_to_left}, Right: {distance_to_right}, Top: {distance_to_top}, Bottom: {distance_to_bottom}")
    # print(f"Padding - Left: {left_padding}, Right: {right_padding}, Top: {top_padding}, Bottom: {bottom_padding}")

    image_padded = np.pad(
        image,
        ((top_padding, bottom_padding), (left_padding, right_padding)),
        mode="constant",
        constant_values=0,
    )

    if squeezed:
        image_padded = image_padded[np.newaxis, ...]

    return image_padded


def crop_padding(image, pad_width):
    """This function crops the padding from an image"""
    return image[:, pad_width:-pad_width, pad_width:-pad_width]

def mse(image1, image2):
    """This function calculates the mean squared error between two images"""
    return np.mean((image1 - image2) ** 2)


def psnr(image1, image2):
    """This function calculates the peak signal-to-noise ratio between two images"""
    mse_value = mse(image1, image2)
    max_value = np.max(image1)
    return 10 * np.log10(max_value**2 / mse_value)

np.random.seed(42)  # Use any fixed number for reproducibility


def _Ax(
    i: int,
    fov: tuple,
    theta: np.ndarray,
    centers_of_rotation: list,
    x: np.ndarray, ## x is the entire domain, not just a roi
    geos = dict
):
    
    # Center the domain around the rotation center
    rot_center = centers_of_rotation[i]

    if x.ndim == 2:
        x = x[np.newaxis, ...]
    
    centered_x = pad_around_center(
        x, rot_center
    )  # Padded to simulate local tomography with the sample moving
    
    # Calculate projection windows considering both coordinate systems
    window_half_width = fov[1] // 2
    
    # Max because TIGRE's parallel geometry defines the number of horizontal pixels in the detector as the max of the last two dimensions of the volume
    detector_center = np.max((centered_x.shape[-2] // 2, centered_x.shape[-1] // 2))
    
    # ROI in detector coordinates
    det_start = int(detector_center - window_half_width)
    det_end = int(detector_center + window_half_width)
    
    # Define geometry and calculate projection
    if centered_x.shape not in geos:
        geo = tigre.geometry(mode="parallel", nVoxel=np.array(centered_x.shape))
        geos[centered_x.shape] = geo
    else:
        geo = geos[centered_x.shape]
    
    y = tigre.Ax(centered_x, geo, theta)
    
    # Crop the projection to the detector
    det_y = y[:, :, det_start:det_end]


    return det_y


def _Atb(
    theta: np.ndarray,
    y: np.ndarray,
    geos: dict,
    padding_factor: float = 1,
):

    # y is a sinogram where each row is a projection, hence the first dimension is the number of projections and the second and third dims are the number of detector pixels (h,w)
    fov = y.shape[-1]
    # Pad sinogram
    pad_width = int((fov * padding_factor) // 2)
    padded_y = np.pad(y, ((0, 0), (0, 0), (pad_width, pad_width)), mode="edge")
    # Backproject
    local_domain = (padded_y.shape[-2], padded_y.shape[-1], padded_y.shape[-1])
    if local_domain not in geos:
        local_geo = tigre.geometry(mode="parallel", nVoxel=np.array(local_domain))
        geos[local_domain] = local_geo
    else:
        local_geo = geos[local_domain]
    x_tilda = crop_padding(tigre.Atb(padded_y, local_geo, theta), pad_width)

    return x_tilda


def stitch(scans: dict[tuple, np.ndarray], domain: tuple):
    Nscans = len(scans)
    output = np.zeros((Nscans, 1, domain[0], domain[1]), dtype=np.float32)
    for i, (rot_center, roi) in enumerate(scans.items()):
        half_roi = int(roi.shape[-2] // 2), int(roi.shape[-1] // 2)
        s = (
            slice(rot_center[0] - half_roi[0], rot_center[0] + half_roi[0]),
            slice(rot_center[1] - half_roi[1], rot_center[1] + half_roi[1]),
        )
        output[i, 0][s] = roi
    nelements = np.sum(output.astype(bool), axis=0)
    nelements = np.where(nelements == 0, 1, nelements)
    output = np.sum(output, axis=0) / nelements # TODO: This is a naive way of averaging the local scans
    return output


def main():
    
    gt = shepp_logan_phantom().astype(np.float32)[None, ...]  # TODO: Forced 3D
    domain = gt.shape
    hr_fov = (100, 100)
    spacing = (50, 50)
    PADDING = 1  # This is a percentage!
    NANGLES = 1000
    angles = np.linspace(0, 2 * np.pi, NANGLES)

    centers = [(100, 300), (350, 100), (150, 200), (350, 350), (200, 100), (250, 150), (50, 250), (300, 50), (100, 150), (200, 350), (150, 50), (300, 300), (350, 200), (150, 300), (50, 150), (200, 200), (50, 100), (250, 250), (50, 350), (300, 150), (100, 250), (350, 50), (150, 150), (350, 300), (200, 50), (250, 100), (200, 300), (50, 200), (250, 350), (100, 100), (300, 250), (100, 350), (350, 150), (150, 250), (200, 150), (50, 50), (250, 200), (50, 300), (300, 100), (100, 200), (300, 350), (150, 100), (350, 250), (250, 50), (150, 350), (200, 250), (250, 300), (100, 50), (300, 200)]
    print(centers)
    N = len(centers)

    ax_geos = {}
    atb_geos = {}

    Ax = partial(
        _Ax,
        fov=hr_fov,
        centers_of_rotation=centers,
        theta=angles,
        geos=ax_geos
    )
    
    Atb = partial(
        _Atb,
        theta=angles,
        padding_factor=PADDING,
        geos=atb_geos
    )


    ys = np.array([Ax(i=i, x=gt) for i in range(len(centers))]).astype(np.float32)
    x = np.zeros(domain, dtype=np.float32)

    # Optimization parameters
    learning_rate=5e-6
    tolerance=1e-6
    max_iterations=1000
    lambda_weight=1

    K = 10 # Simulates doing the optimization K times with different parameters

    for _ in range(K):
        pbar = tqdm(range(max_iterations))
        for k in pbar:

            # print(f"{len(ax_geos)=}")
            # print(f"{len(atb_geos)=}")

            step_ys = np.array([Ax(i=i, x=x) for i in range(N)]).astype(np.float32)

            gradients = {
                centers[i]: (Atb(y=(ys[i] - step_ys[i])))
                for i in range(N)
            }
            gradient = stitch(gradients, domain[1:]).astype(np.float32)

            grad_fidelity = lambda_weight * gradient

            grad = -grad_fidelity

            x_new = x - learning_rate * grad

            x_new = np.maximum(x_new, 0) 

            dNorm = np.linalg.norm(x_new - x)
            if dNorm < tolerance:
                break

            # Update progress bar
            progress_desc = f"dNorm: {dNorm:.4f}"
            if gt is not None:
                gt_loss = mse(x_new, gt)
                progress_desc += f" | GT loss: {gt_loss:.4f}"
            pbar.set_description(progress_desc)
            
            # Update
            x = x_new


if __name__ == "__main__":
    main()

Specifications

  • python version: 3.11
  • OS: x86_64 GNU/Linux
  • CUDA version: 12.4
  • pytigre version: 2.4.0
@AnderBiguri
Copy link
Member

Interesting...

Indeed the operators (regardless of geometry being changed or not) should not take any memory, i.e. the footprint of the memory after calling them should be the same (or +image size, if the image variable was not created before the call).

This could be a serious bug, but I have never experienced this before, so I wonder if there is a combination of OS/CUDA/etc version that is causing this go badly. To try to verify this, can you try to make even a smaller example? If this is caused by the TIGRE operators, it should also happen if you don't use your definitions of _Ax and _Atb, for example. Can you make an example without all that, to see if it has to do with the way you handle the python memory?

Also, could you do another test for me? Can you run this with fan-beam code? I understand the result will be wrong, but the codebase for parallel-beam and for fan-beam is different, and I wonder if there is a big in parallel beam, which is considerably less used by the community, so more prone to undetected bugs arising.

@AnderBiguri
Copy link
Member

A third question: Can you see which of Ax/ATb is the one causing this memory blow-up? You can test this but just running them in a loop, no need to have them in a mathematically correct loop (i.e. the gradient)

@dveni
Copy link
Author

dveni commented Dec 5, 2024

Indeed my minimal code could have been more minimal :)

Good catch, it seems that it's an issue with the Atb operator. The following code throws the same cuda oom error in the iteration 707 after 29 mins.

../Common/CUDA/TIGRE_common.cpp (7): Main loop fail
../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb out of memory
from functools import partial
import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm

def main():
    
    gt = shepp_logan_phantom().astype(np.float32)[None, ...]
    domain = gt.shape
    hr_fov = (100, 100)
    spacing = (50, 50)
    PADDING = 1  # This is a percentage!
    NANGLES = 1000
    angles = np.linspace(0, 2 * np.pi, NANGLES)

    centers = [(100, 300), (350, 100), (150, 200), (350, 350), (200, 100), (250, 150), (50, 250), (300, 50), (100, 150), (200, 350), (150, 50), (300, 300), (350, 200), (150, 300), (50, 150), (200, 200), (50, 100), (250, 250), (50, 350), (300, 150), (100, 250), (350, 50), (150, 150), (350, 300), (200, 50), (250, 100), (200, 300), (50, 200), (250, 350), (100, 100), (300, 250), (100, 350), (350, 150), (150, 250), (200, 150), (50, 50), (250, 200), (50, 300), (300, 100), (100, 200), (300, 350), (150, 100), (350, 250), (250, 50), (150, 350), (200, 250), (250, 300), (100, 50), (300, 200)]
    print(centers)
    N = len(centers)
    x = np.zeros(domain, dtype=np.float32)

    geo = tigre.geometry(mode="fan", nVoxel=np.array(x.shape))
    ys = np.array([tigre.Ax(gt, geo, angles) for center in centers])
    
# Optimization parameters
    learning_rate=5e-6
    tolerance=1e-6
    max_iterations=1000
    lambda_weight=1

    K = 10 # Simulates doing the optimization K times with different parameters

    for _ in range(K):
        pbar = tqdm(range(max_iterations))
        for k in pbar:
            bped = np.array([tigre.Atb(ys[i], geo, angles) for i in range(ys.shape[0])])
if __name__ == "__main__":
    main()

@AnderBiguri
Copy link
Member

@dveni Thanks! It seems that it also happens with fan beam. Interesting. I will dig into this, hopefully I can find the issue.

@AnderBiguri
Copy link
Member

@dveni temporarily, you will find this line of code:

// cudaDeviceReset(); // For the Nvidia Visual Profiler

In also the files https://github.com/CERN/TIGRE/blob/master/Common/CUDA/voxel_backprojection.cu and https://github.com/CERN/TIGRE/blob/master/Common/CUDA/voxel_backprojection_parallel.cu. If you uncomment it and recompile, it should fix the issue.

The line restarts the GPU, which is not ideal and should not be used if the GPU is being used in parallel by other processes or if its being used for ML, but if you are just doing something like the code you showed, restarting it should clean up the memory even if its badly freed.

@dveni
Copy link
Author

dveni commented Dec 5, 2024

Well, it's precisely being used for ML with as many processes as possible to fill up the vram. The code was just a dummy example :)

I'll have to wait then, let me know if I can help

@AnderBiguri
Copy link
Member

AnderBiguri commented Dec 5, 2024

@dveni sure, I'll let you know if I can find any issue. Just FYI, we could not see issues when trying the new pytorch bindings: https://github.com/CERN/TIGRE/blob/master/Python/demos/d25_Pytorch.py . I would anyway suggest you upgrade your TIGRE to v3.0, its been many changes since. I don't remember any memory leak being part of it, but better to be up to date.

Edit: albeit I just seen that the pytigre version has not been updated. Opps.

TIGRE/Python/setup.py

Lines 509 to 510 in 50a5c7b

name="pytigre",
version="2.4.0",

@dveni
Copy link
Author

dveni commented Dec 5, 2024

I'll check out the pytorch bindings, thanks, looks promising :)
Haha just saw the version bump. I installed tigre from the github repo, latest commit is from 11.11 so I guess it should be fine

@AnderBiguri
Copy link
Member

@dveni I just did a test in my machine and indeed I can see an increase of memory footprint of the example code you sent me, but its a very minimal one. After the 1000 iterations it did indeed use around 1GB of RAM on the CPU side. This is which is quite significant, but I had around 14GB of RAM left before it crashed, exactly on iteration 707. This is suspicious, that we got exactly the same number.

@AnderBiguri
Copy link
Member

Specifically at iteration 707 and subiteration 34 (of y.shape). I can't find any reason why this number is the one.

@AnderBiguri
Copy link
Member

AnderBiguri commented Dec 6, 2024

@dveni I am completely stumped by this, I have already looked at it for so many hours. Interestingly, its not a trivial "out of memory". I found exactly where the error happens and in my machine there is still 11GB free when it happens. Digging further, but you found a weird weird one. I have even asked in Stackoverflow: https://stackoverflow.com/questions/79257953/alternative-reasons-for-out-of-memory-error-than-lack-of-free-global-memory

@AnderBiguri
Copy link
Member

This may be caused by memory fragmentation. This means that I don't have a clear way to fix it.
Try the Pytorch bindings that I liked above, and see if that works.

If the issue is memory fragmentation, then this may not have a solution unless I restructure the entire TIGRE. Ah, this was all coded pre-deep learning revolution, so assuming someone would want to call the operators 50K times was absurd! hehe.

Ultimately, if I can't find another solution and the pytorch bindings I linked also cause the issue, the best I can suggest is you use tomosipo. I use it in LION, the tomographic recon+AI library we are building, and it never had any issue.

@dveni
Copy link
Author

dveni commented Dec 9, 2024

Hi Ander! Thanks a lot for digging into this. I have tried the pytorch bindings but, unfortunately, I have the very same issue. This code snippet breaks at iteration 707 (again) after 14'50'':

import numpy as np
import tigre
from skimage.data import shepp_logan_phantom
from tqdm import tqdm
from tigre.utilities.pytorch_bindings import create_pytorch_operator
import torch
import tigre.utilities.gpu as gpu

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

def main():
    
    gt = shepp_logan_phantom().astype(np.float32)[None, ...]
    domain = gt.shape
    hr_fov = (100, 100)
    spacing = (50, 50)
    PADDING = 1  # This is a percentage!
    NANGLES = 1000
    angles = np.linspace(0, 2 * np.pi, NANGLES)

    centers = [(100, 300), (350, 100), (150, 200), (350, 350), (200, 100), (250, 150), (50, 250), (300, 50), (100, 150), (200, 350), (150, 50), (300, 300), (350, 200), (150, 300), (50, 150), (200, 200), (50, 100), (250, 250), (50, 350), (300, 150), (100, 250), (350, 50), (150, 150), (350, 300), (200, 50), (250, 100), (200, 300), (50, 200), (250, 350), (100, 100), (300, 250), (100, 350), (350, 150), (150, 250), (200, 150), (50, 50), (250, 200), (50, 300), (300, 100), (100, 200), (300, 350), (150, 100), (350, 250), (250, 50), (150, 350), (200, 250), (250, 300), (100, 50), (300, 200)]
    print(centers)
    N = len(centers)


    x = np.zeros(domain, dtype=np.float32)

    tigre_devices   = gpu.getGpuIds()
    local_geo = tigre.geometry(mode="fan", nVoxel=np.array(x.shape))
    ax, atb = create_pytorch_operator(local_geo, angles, tigre_devices)
    ys = np.array([ax(torch.from_numpy(gt)).detach().cpu().numpy() for center in centers])

        
    
    # Optimization parameters
    learning_rate=5e-6
    tolerance=1e-6
    max_iterations=1000
    lambda_weight=1

    K = 10 # Simulates doing the optimization K times with different parameters

    for _ in range(K):
        pbar = tqdm(range(max_iterations))
        for k in pbar:
            bped = np.array([atb(torch.from_numpy(ys[i])).detach().cpu().numpy() for i in range(ys.shape[0])])

       
if __name__ == "__main__":
    main()

Checking the memory traces in the mem profile it looks like the pytorch binding calls tigre.Atb too. As it seems that the issue lies in the memory allocation at the cuda level, I guess I'll have to move on to tomosipo as suggested.

Again, thanks for your help :)

@AnderBiguri
Copy link
Member

Indeed it uses the same function. Interestingly, in that stackoverflow post the person who answered is an NVIDIA employee, and they suggest that its plausible that you found a bug in the memory management of CUDA. I'll let you know if that leads to anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants