Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Cython dependency #74

Closed
carloshorn opened this issue Aug 6, 2020 · 7 comments · Fixed by #83
Closed

Remove Cython dependency #74

carloshorn opened this issue Aug 6, 2020 · 7 comments · Fixed by #83

Comments

@carloshorn
Copy link
Collaborator

carloshorn commented Aug 6, 2020

It would be nice to replace correct_tsm_issue.mean_filter with a pure python implementation.

A possible solution is:

import numpy as np
import bottleneck as bn

def rolling_window(a, shape):  # rolling window for 2D array
    s = (a.shape[0] - shape[0] + 1,) + (a.shape[1] - shape[1] + 1,) + shape
    strides = a.strides + a.strides
    return np.lib.stride_tricks.as_strided(a, shape=s, strides=strides)

def std_filter(data, box_size):
    shape = (box_size, box_size)
    border = box_size // 2
    # need to surround the data with NaNs to calculate values at the boundary
    padded_data = np.pad(
        data, (border, border), 
        mode='constant', 
        constant_values=np.nan
    )
    windows = rolling_window(padded_data, shape)
    n3, n2, n1, n0 = windows.shape
    windows = windows.reshape((n3, n2, n1*n0))
    result = bn.nanstd(windows, axis=2)
    return result

Using bottleneck instead of the equivalent numpy function reduces the processing time to become competitive with Cython or numba.

Another possibility would be an implementation with numba:

import numpy as np
import numba as nb

@nb.njit
def std_filter_2(data, box_size):
    N = box_size // 2
    nrows, ncols = data.shape
    std = np.empty_like(data)
    for j in range(nrows):
        for i in range(ncols):
            total = 0.0
            count = 0
            jj_min = max(0, j-N)
            jj_max = min(nrows, j+N+1)
            ii_min = max(0, i-N)
            ii_max = min(ncols, i+N+1)
            for jj in range(jj_min, jj_max):
                for ii in range(ii_min, ii_max):
                    value = data[jj, ii]
                    if np.isnan(value):
                        continue
                    total += value
                    count += 1
            if count == 0:
                std[j, i] = np.nan
                continue
            mean = total / count
            diff2 = 0.0
            for jj in range(jj_min, jj_max):
                for ii in range(ii_min, ii_max):
                    value = data[jj, ii]
                    if np.isnan(value):
                        continue
                    diff2 += (value - mean)**2
            std[j, i] = np.sqrt(diff2 / count)
    return std

Edit: Slightly cleaner numba implementation.

@mraspaud
Copy link
Member

mraspaud commented Aug 6, 2020

I find the numpy/bottleneck implementation cleaner.
Could you provide some numbers to get an idea of the performance?

@carloshorn
Copy link
Collaborator Author

image

Using pure numpy (1300 ms) is definitely too slow, because processing an entire file with pygac-run takes about 10 s. However, numpy + bottleneck (272 ms) is competitive with numba (115 ms), which should be at a similar speed as a Cython implementation.

@mraspaud
Copy link
Member

mraspaud commented Aug 6, 2020

That's really nice!

@sfinkens
Copy link
Member

👍 for the numpy + bottleneck variant!

@carloshorn
Copy link
Collaborator Author

A clear plus for numpy + bottleneck is that both packages have a major version greater than one. People could argue that a stable release should be a minimum requirement for production code.
Maybe in future the numba stancil feature could be an elegant solution, but for today I would also vote for numpy + bottleneck.

@djhoese
Copy link
Member

djhoese commented Aug 10, 2020

both packages have a major version greater than one.

This is not a good reason for choosing packages in the python ecosystem, especially the scientific python ecosystem. I've argued the point that Cython needs to go to version 1.0 in the past on other forums, but they had their own timeline for things. Regardless, it looks like Cython is working on a 3.0 release: https://pypi.org/project/Cython/#history (skipping 1.0 and 2.0)

@carloshorn
Copy link
Collaborator Author

@djhoese, that's true, good example for your point is the pandas library. It was the first choice for many things long before the release of 1.0 on January this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants