APIs for discovering array structure as pertains to linear algebra operations #783
Replies: 8 comments
-
Thanks for your thoughts @ilayn! I think there are two separate topics here: matrix structure, and
There is an |
Beta Was this translation helpful? Give feedback.
-
Efficient structure discovery is interesting. The first question I have is how well-defined that topic is. The utilities are easy to implement in principle perhaps, e.g.: def issymmetric(x):
return all(x == matrix_transpose(x)) which I believe is all there is to it for exact symmetry. So to make sure: is that always what you want here, or would one need some inexact approximation to symmetry where if all elements are equal to within a certain tolerance, one chooses a symmetric solver routine? That's what is being done now in SciPy: https://github.com/scipy/scipy/blob/840a9ed0f41f84ce0bbf8c104bfbbbe93781cbdd/scipy/linalg/_cythonized_array_utils.pyx#L239-L323 It seems a little tricky to standardize that, because I imagine it's more an internal routine than something that NumPy, PyTorch et al. would want to expose directly to their end users. If so, perhaps what is needed is more a library-agnostic implementation of this function? For your purposes, it's something SciPy should be able to easily vendor, and it should be an implementation that's guaranteed to work well with PyTorch, CuPy, JAX, et al. Correct? |
Beta Was this translation helpful? Give feedback.
-
Indeed, if we have these per array vendor implementations for such operations, then probably SciPy won't be needing all that home-baked Cythonized array utilities to shave off some microseconds. Then we can rely on performant check routines of the arrays out-of-the box in a unified fashion.
Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues |
Beta Was this translation helpful? Give feedback.
-
Ah yes, this is one of the examples I have. So imagine you have a 5000x5000 array and the [3, 2] element is not equal to [2, 3]. This test goes all the way to scan 5000**2 entries although it is obvious that it will return false in the end. Hence this is what I mean with quick-exit hatches. These are occasionally costing the same time the actual algorithm will take. So we start injecting custom compiled codes in SciPy . If we want to use less compiled code, array API should actually provide similar performant routines. |
Beta Was this translation helpful? Give feedback.
-
Ah, okay - got it. Basically, we're missing a fused |
Beta Was this translation helpful? Give feedback.
-
Yes exactly here is what we are doing currently |
Beta Was this translation helpful? Give feedback.
-
I had a closer look at this: >>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
24.7 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.isfinite(x)
17.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.3 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) So Use of >>> %timeit np.allclose(x, 0.0)
2.92 ms ± 7.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) My feeling is that a gain of ~2x for fused functions is a nice-to-have, but not low-hanging fruit. Things like |
Beta Was this translation helpful? Give feedback.
-
Ah I should clarify, the np.allclose is used only if the user wants an inexact comparison, that is if things can be off a little bit in the lower and upper part (hence the check for if atol or rtol is provided). In the actual tests it actually goes through the array with double loop. Nice comparison anyways, here is my results for the code you provided at the top just to give sense of timing of my machine >>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
>>> %timeit np.isfinite(x)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.7 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) I am going to use the symmetricity test since that we have already implemented in scipy.linalg >>> A = np.random.rand(100, 100) + 50*np.eye(100)
>>> A = A + A.T
>>> %timeit np.all(A == A.T)
11 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit la.issymmetric(A)
7.07 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops eac OK some benefits, nothing to make noise about. But now I'll change the [1,2] element >>> A[1, 2] = 0
>>> %timeit la.issymmetric(A)
3.36 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit np.all(A == A.T)
11 µs ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) So here even though our scipy implementation is not even close to cache-friendliness, still gets 4x and that is mostly because we Cythonize the code and not really using NumPy C access speed which will take this easily to 10x levels. The trick is that it stops the test the moment it fails and does not carry all the way to the end. In other words, one evidence is enough. Similar things can be done with isfinite, nonzero structure etc. with similar quick returns. When linalg functions start checking the arguments these small costs start to pile up and cause a lot of performance even if we manage to find to circumvent f2py and link directly to OpenBLAS etc. |
Beta Was this translation helpful? Give feedback.
-
I would like to move the recent discussion from SciPy scipy/scipy#19068 to here for a more specialized treatment.
In the current incarnation of linalg functions, the most important distinction for dispatching to different low level routines are the array structures. Eigenvalues, SVD, linear solve, cholesky, LU, QR and so on all benefit from additional structure to gain performance and reduce the workload.
To that end, typically a quite permissive, fast, and more importantly, with quick-exit hatches, function is implemented to do the early O(n) investigation to discover non-finite entries, triangular, hessenberg, banded structures to report back to the dispatcher.
These type of algorithms typically go by the name of polyalgorithms and becoming quite standard. The most well-known is arguably the backslash polyalgorithm of matlab that selects the right low level implementation automatically when user provides
A\b
.There is a stalled (mostly due to missing infrastructure and lack of maintainer time) attempt in SciPy regarding this (scipy/scipy#12824 for more details). However in this issue I'd like to point out that these discovery functions are missing from the array API which is going to be relevant for any library that is going to be implementing linalg functionality.
The naming can be changed but the missing functionality is mainly
Most of SciPy functions are crippled in terms of performance due to lack of the last item since
check_finite
is a massive performance eater especially for sizes this array API is considering. Similarly, checking for all zeros and other details can be also pushed to native low level code for increased performance. The main discomfort we have as array consumers is that these functions are typically don't quick exit and consider all array entries, mostly usingnp.all(A == 0)
type of shortcuts. This is pretty wasteful as the arrays get larger and provides an unconditional O(n**2) complexity even though probably the first entry is nonzero.Hence either low-level underscored or exposed to public API these are the functionalities over SciPy we are hoping to get at in the near future. Moreover as we reduce our fortran codebase most of these will be carried over to compiled code for even more reduced overhead.
I'd like to get some feedback about what the array vendors think about this. In fact if this comes out of the box from array vendors that would in fact be much better. Since I did not see any work towards these issues, I wanted to bring the discussion here.
Beta Was this translation helpful? Give feedback.
All reactions