Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add version of Image::getArray (and Mesh::getField) that only wraps data in numpy.ndarray rather than copying it #903

Closed
cchriste opened this issue Jan 16, 2021 · 4 comments · Fixed by #1427
Assignees

Comments

@cchriste
Copy link
Contributor

It's inefficient to copy an image's data in order to modify it with numpy and then reconstruct the image from the copied data (which probably copies it again). Instead, we can use this version with a capsule, except our capsule will not free the data when the numpy array goes out of scope since it's actually owned by the itk image.

This will make operations like the ones @jadie1 is performing much more efficient.
The seeming danger is the image being free'd and the numpy array trying to be used afterwards. So add a note to the function's documentation (to be defined as a lambda in ShapeworksPython.cpp).

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

PYBIND11_PLUGIN(numpywrap) {
    py::module m("numpywrap");
    m.def("f", []() {
        // Allocate and initialize some data; make this big so
        // we can see the impact on the process memory use:
        constexpr size_t size = 100*1000*1000;
        double *foo = new double[size];
        for (size_t i = 0; i < size; i++) {
            foo[i] = (double) i;
        }

        // Create a Python object that will free the allocated
        // memory when destroyed:
        py::capsule free_when_done(foo, [](void *f) {
            double *foo = reinterpret_cast<double *>(f);
            std::cerr << "Element [0] = " << foo[0] << "\n";
            std::cerr << "freeing memory @ " << f << "\n";
            delete[] foo;
        });

        return py::array_t<double>(
            {100, 1000, 1000}, // shape
            {1000*1000*8, 1000*8, 8}, // C-style contiguous strides for double
            foo, // the data pointer
            free_when_done); // numpy array references this parent
    });
    return m.ptr();
}
>>> import numpywrap
>>> z = numpywrap.f()
>>> # the python process is now taking up a bit more than 800MB memory
>>> z[1,1,1]
1001001.0
>>> z[0,0,100]
100.0
>>> z[99,999,999]
99999999.0
>>> z[0,0,0] = 3.141592
>>> del z
Element [0] = 3.14159
freeing memory @ 0x7fd769f12010
>>> # python process memory size has dropped back down

From https://stackoverflow.com/questions/44659924/returning-numpy-arrays-via-pybind11

Note: This topic seems to confirm that itkImage::GetBufferPointer() does not copy the data: http://itk-insight-users.2283740.n2.nabble.com/ITK-users-ITK-Image-data-pointer-and-GPU-algorithms-td7589655.html
(Our current Image.toArray function does copy the data because that's the default behavior of the py::array constructor.)

Lots of info here useful to ensure data is deallocated when no longer needed: pybind/pybind11#1042
For example, it may be possible to use a simpler version, or to ensure data is not deleted until it goes out of scope in both C++ and Python. But since it's the itkImage's data, I'm not sure we'd have that control, so the simplest method might be to just not delete it, recognizing the potential risk on the Python side, and gaining the (significant) advantage of not copying back and forth.

This is probably not difficult, but I want to make sure we have a use case, so @jadie1 when you have time please add one to this issue. I'll implement it and throw it back over the fence for you to test. Thanks!

@cchriste
Copy link
Contributor Author

I just want to reiterate: this is probably not difficult and there could be very significant performance benefits.
But it will be more constructive to go in with a concrete use case.

@jadie1
Copy link
Contributor

jadie1 commented Feb 11, 2021

One use case is in the ShapeCohortGenPackage. In the generate_images() function in CohortGenUtils.py we take a segmentation, add a foreground and background with noise and blur the boundary to create a synthetic image.

As a very simple test:

  • Take any image and get a numpy array from it
  • Add a constant value to the numpy array
  • Write the numpy array back to an image (this is the part that doesn't work currently)

The output image should look the same as the input image but brighter.

@cchriste
Copy link
Contributor Author

cchriste commented Feb 12, 2021

  • Write the numpy array back to an image (this is the part that doesn't work currently)
    This issue is meant to remove this necessity by simply passing a reference to the original image's memory so that modification thereof will modify the image itself.

Keep in mind that if the numpy image is reallocated, say by changing the image size, it will no longer point to this memory (though the memory can be passed without passing ownership thereof, so it wouldn't introduce a crash potential).
https://pybind11.readthedocs.io/en/stable/advanced/functions.html#return-value-policies

@cchriste
Copy link
Contributor Author

cchriste commented May 27, 2021

HOWTO go from C++ to Python withut copying:
// Pass pointer to c++ data to Python without copying.
// Also see the bindings for Physical/IndexRegion.min/max in ShapeworksPython.cpp
void addInput(std::shared_ptr<const MyFancyMatrix> packet)
{
  // When a valid object is passed as 'base', it tells pybind not to take ownership of the data,
  // because 'base' will own it. In fact 'packet' will own it, but - psss! - , we don't tell it to
  // pybind... Alos note that ANY valid object is good for this purpose, so I chose "str"...
  py::str dummyDataOwner;
  py::array input{getNumpyDTypeFromElementType(packet->elementType()), packet->shape(),
                  packet->strides(), packet->data(), dummyDataOwner};

  assert(!input.owndata()); // if fires, s.g. is not OK with ownership (probably matrix data was copied)
  // ...
}
(from https://github.com/pybind/pybind11/issues/323#issuecomment-575717041)

HOWTO to go from Python to C++ without copying:
// Also see our Image init function that copies from a py::array in ShapeworksPython.cpp.
// That function may be able to steal or share the data from numpy rather than copying it.
/**
 * \brief Returns span<T> from py:array_T<T>. Efficient as zero-copy.
 * \tparam T Type.
 * \param passthrough Numpy array.
 * \return Span<T> that with a clean and safe reference to contents of Numpy array.
 */
template<class T=float32_t>
inline std::span<T> toSpan(const py::array_t<T>& passthrough)
{
	py::buffer_info passthroughBuf = passthrough.request();
	if (passthroughBuf.ndim != 1) {
		throw std::runtime_error("Error. Number of dimensions must be one");
	}
	size_t length = passthroughBuf.shape[0];
	T* passthroughPtr = static_cast<T*>(passthroughBuf.ptr);
	std::span<T> passthroughSpan(passthroughPtr, length);
	return passthroughSpan;
}
(from https://github.com/pybind/pybind11/issues/1042#issuecomment-663154709)

@cchriste cchriste changed the title Add version of Image::getArray that only wraps data in numpy.ndarray rather than copying it Add version of Image::getArray (and Mesh::getField) that only wraps data in numpy.ndarray rather than copying it Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants