-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly reserve the grid data dtype by converting ctypes array to numpy array with np.ctypeslib.as_array #3446
Conversation
9ad29c9
to
fea06b5
Compare
fea06b5
to
fb97daa
Compare
@@ -41,4 +41,4 @@ def test_sphinterpolate_no_outgrid(mars): | |||
npt.assert_allclose(temp_grid.max(), 14628.144) | |||
npt.assert_allclose(temp_grid.min(), -6908.1987) | |||
npt.assert_allclose(temp_grid.median(), 118.96849) | |||
npt.assert_allclose(temp_grid.mean(), 272.60578) | |||
npt.assert_allclose(temp_grid.mean(), 272.60593) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grid mean value is updated from 272.60578
to 272.60593
.
To verify which value is correct, I first generate the output grid using the following GMT CLI:
gmt sphinterpolate @mars370d.txt -Rg -I1 -Gout.nc
then get the mean value by:
In [1]: import xarray as xr
In [2]: da = xr.load_dataarray("out.nc")
In [3]: da.mean()
Out[3]:
<xarray.DataArray 'z' ()> Size: 4B
array(272.60593, dtype=float32)
With option 1, the ctypes array was first converted to a Python list and then converted to a numpy array by np.reshape
. The dtype
of the result numpy array is float64
. However, the GMT netCDF grids are stored in float32
by default. So, there are data type conversions with option 1, which can cause the tiny difference in the data mean value.
One thing to note is that, in GMT, the mean value is weighted by spherical area for geographic grids, so it's different from the one reported by xarray.
gmt grdinfo out.nc -L2
out.nc: Title:
out.nc: Command: gmt sphinterpolate /Users/seisman/.gmt/cache/mars370d.txt -Rg -I1 -Gout.nc
out.nc: Remark:
out.nc: Gridline node registration used [Geographic grid]
out.nc: Grid file format: nf = GMT netCDF format (32-bit float), CF-1.7
out.nc: x_min: 0 x_max: 360 x_inc: 1 name: longitude n_columns: 361
out.nc: y_min: -90 y_max: 90 y_inc: 1 name: latitude n_rows: 181
out.nc: v_min: -6908.19873047 v_max: 14628.1435547 name: z
out.nc: scale_factor: 1 add_offset: 0
out.nc: mean: 177.476587475 stdev: 3671.3669124 rms: 3675.62602319
out.nc: format: netCDF-4 chunk_size: 181,181 shuffle: on deflation_level: 3
out.nc: Default CPT:
CodSpeed Performance ReportMerging #3446 will improve performances by 82.61%Comparing Summary
Benchmarks breakdown
|
We can also declare it as a bug, since the data dtype was not preserved in the old version. |
|
||
# The data array without paddings | ||
data = np.ctypeslib.as_array(self.data, shape=(header.my, header.mx)).copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to be cautious that we need to make a copy of the ctypes/numpy array, since the memory of the array is allocated by GMT and will be freed when exiting the session.
Trying to understand this sentence a bit more. So we can't assign ownership of the array data from GMT to numpy/xarray, and need to make a copy to ensure that exiting the GMT session doesn't crash the program? It doesn't look like we needed to explicitly make a copy before, so why is it needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can't assign ownership of the array data from GMT to numpy/xarray, and need to make a copy to ensure that exiting the GMT session doesn't crash the program?
Yes, the arrays allocated by GMT will be freed when exiting the GMT session. This can be verified by the example below. In the GMT session, we can correctly access the grid.x
array, but outside the session, grid.x
reports a "NULL pointer access" error.
In [6]: from pygmt.clib import Session
In [7]: with Session() as lib:
...: with lib.virtualfile_out(kind="grid") as voutgrd:
...: lib.call_module("read", ["@static_earth_relief.nc", voutgrd, "-Tg"])
...: grid = lib.read_virtualfile(voutgrd, kind="grid").contents
...: header = grid.header.contents
...: print(grid.x[:header.n_columns])
...: print(grid.x[0:5])
[-54.5, -53.5, -52.5, -51.5, -50.5, -49.5, -48.5, -47.5]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 7
5 header = grid.header.contents
6 print(grid.x[:header.n_columns])
----> 7 print(grid.x[0:5])
ValueError: NULL pointer access
It doesn't look like we needed to explicitly make a copy before, so why is it needed now?
As explained in the top post, in the previous version, we first access the ctypes array via the list slice syntax (i.e., ctypes_array[:120]
) and then convert the list slice to a numpy array by np.reshape
. The list slice doesn't make a new copy of the array memory, but the list->ndarray conversion changes the data dtype from float32
(the default dtype for GMT grid) to float64
(the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:
array1 = np.reshape(ctypes_array[:120], (10, 12))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we needed to explicitly make a copy before, so why is it needed now?
As explained in the top post, in the previous version, we first access the ctypes array via the list slice syntax (i.e.,
ctypes_array[:120]
) and then convert the list slice to a numpy array bynp.reshape
. The list slice doesn't make a new copy of the array memory, but the list->ndarray conversion changes the data dtype fromfloat32
(the default dtype for GMT grid) tofloat64
(the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:array1 = np.reshape(ctypes_array[:120], (10, 12))
Here is a test script:
In [22]: import ctypes
In [23]: import numpy as np
In [24]: ctypes_array = (ctypes.c_uint8 * 12)(*range(12))
In [25]: array1 = np.reshape(ctypes_array[:12], (3, 4))
In [26]: ctypes_array[:12]
Out[26]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
In [27]: array1
Out[27]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [28]: ctypes_array[:12] = list(range(100, 112))
In [29]: ctypes_array
Out[29]: <__main__.c_ubyte_Array_12 at 0x7f88d81b1fd0>
In [30]: ctypes_array[:12]
Out[30]: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
In [31]: array1
Out[31]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the list->ndarray conversion changes the data dtype from
float32
(the default dtype for GMT grid) tofloat64
(the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:
Ok, this part makes sense, so we were implicitly doing a copy of the array before, whereas now it's more explicit via a .copy()
call.
the arrays allocated by GMT will be freed when exiting the GMT session.
Thanks for showing the "NULL pointer access" exception. We probably don't want the pointer to remain active after the session is closed since that would be a memory leak, but is there a way for GMT to pass ownership of the array data to NumPy before exiting the session, so that the memory is then managed by Python (and its garbage collector)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way for GMT to pass ownership of the array data to NumPy before exiting the session, so that the memory is then managed by Python (and its garbage collector)?
I thought the API function GMT_Set_AllocMode
may help, but I just tried it locally and it doesn't work as I expected. Maybe another upstream bug or something else that needs to debug the C codes.
Edit: The API documentation says:
When external programs supply their own allocated memory (e.g., grid arrays, matrices) that are then hooked into the GMT containers data pointer, we need to flag the allocation mode as being external to GMT.
...
This change prevents GMT from trying to free memory it did not allocate.
The API function only tells GMT to not free the memory allocated by Python and can't do the reverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this post https://stackoverflow.com/questions/28760273/python-ctypes-and-garbage-collection and some answers from ChatGPT, it seems even if it's possible to tell GMT to not free the memory, the Python side is still responsible to free the memory explicitly. Otherwise, there may be memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we either need to:
- Allocate the memory space in Python and have GMT C use it with
GMT_Set_AllocMode
so that memory is still managed by Python - Allocate memory space in GMT C, pass ownership to Python, and ensure that Python handles free-ing the memory when no longer needed
Maybe something to think about longer term if we want to avoid copies 🙂
Description of proposed changes
There are two different ways to convert a ctypes array to a numpy array with a specific shape:
np.reshape(ctypes_array[:120], (10, 12))
np.ctypeslib.as_array(ctypes_array, (10, 12))
Currently we use option 1 but option 2 is better. In this PR, we use option 2.
As shown in the benchmarks below. Option 2 has at least two pros:
Also need to be cautious that we need to make a copy of the ctypes/numpy array, since the memory of the array is allocated by GMT and will be freed when exiting the session.