-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elementwise numerical integration #155
Comments
It is a compelling solution to achieve some speed-ups, however I do not see how can I skip the python for loop using it |
Ah ok, I think I am starting to see what you mean. So, the problem is a bit outside of the classic domain where you would look for speedups. For a integration method point, at the moment, I could see easy speedups if:
If neither condition is true these are in the end different integrations without overlap, thus the optimization would require to vectorize by basically having one tensor over which we compute different integrations grid, function evaluations and integrals. Probably one can save some memory by rescaling the integration grid onto the different domains but the function evaluations still need to be computed & stored. And, finally, the integration rule applied. Practically, the easiest way would probably be to have a version of https://github.com/esa/torchquad/blob/main/torchquad/integration/newton_cotes.py , say 'newton_cotes_multiple_integrands' or such for this that handles scaling the grid to different domains and calls 'apply_composite_rule' on the relevant parts of the big tensor for all integrands. I am not 100% how much speedup this will give but it should at least be some in case the overhead of creating the grid dominates the for loop. Currently, however, this is not implemented. Would you, @sebastienwood , be interested in implementing this by any chance? |
Hello, I would also be interested. Concretely, I want to be able to replicated what one can do with import torch
import numpy as np
from torchquad import Boole, set_up_backend, Simpson
import torchquad
torchquad._deployment_test()
set_up_backend("torch", data_type
torchquad.set_log_level('ERROR')
grid_np = np.array([np.arange(15) for i in range(10)])
grid_torch = torch.from_numpy(grid_np).cuda()
flattened_grid_torch = grid_torch.flatten()
def f_torch(t, p):
return torch.exp(-t) * p
def f_np(t):
return np.einsum("i,jk->ijk", np.exp(-t), grid_np).transpose(1, 2, 0)
def torchquad_integral(try_reuse_sample_points):
integrator = Simpson()
dim = 1
N = 37
domain = torch.Tensor([[0, 10]])
if try_reuse_sample_points:
grid_points, hs, n_per_dim = integrator.calculate_grid(N, domain)
integrands = [integrator.evaluate_integrand(lambda s: f_torch(s, p), grid_points)[0] for p in flattened_grid_torch]
integral_value = torch.tensor([integrator.calculate_result(int_val, dim, n_per_dim, hs) for int_val in integrands])
return integral_value.resize(*grid_torch.shape)
integrate_jit_compiled_parts = integrator.get_jit_compiled_integrate(
dim, N, backend="torch"
)
integral_value = torch.tensor([integrate_jit_compiled_parts(lambda s: f_torch(s, p), domain) for p in flattened_grid_torch])
return integral_value.resize(*grid_torch.shape) I would be willing to make a PR if I could have some guidance and you think that it's possible to beat |
Apologies, accidentally threw a different integrator in there. Updated now |
@ilan-gold First, thanks for the feedback and willingness to help!
In general, yes but it depends a bit on the scale of the different parameters, I think. The GPU will only benefit you if there is enough parallelism, so for small N or grids, I don't think there is hope for small cases. But for larger ones certainly.
Looks good to me! Performance-wise, the only way with the current version of torchquad where I could see more speedup happening is if you were to reformulate your integration problem into a higher dimensional one and thus treat the grid of integrands as one big integrand. (the points you are evaluating basically become the integration grid in that dimension?) But this may not be possible in all applications.
That's great! So, the question is a bit how we set it up, I think. We could do something similar to what I described above
This would be pretty generic but speedup may be limited. Alternatively, maybe we could have some solver for composite integrals, allowing to integrate g((f(x),x) or something like that (maybe rather g(f(x),y)) for simplicity)? Btw. seeing that you are at TUM: Part of torchquad was already developed as master thesis at TUM supervised by Prof. Bungartz. He has a dual affiliation with math, so if you are interested in this, it could e.g. be an interdisciplinary project, guided research or master thesis. ( @FG-TUM may be able to give more details if of interest :) ) |
I always like me some open source :)
Just so I can make sure I understand,
I don't know here. My use-case is evaluating this integral over a pair of matrices to produce a single complex matrix which I can then 2D inverse fourier transform (to give you a sense of why I am interested in repeated integration on a "grid" in the integrand). So the integral needs to be computed once per each pair of (i,j) entries from the two matrices. I would need to think if I could reformulate this. It seems like the "bottleneck" is the fact that
Unfortunately this is for my thesis haha. But I am happy to help. I am curious if you know why the scipy implementation is so much faster. It seems like |
To dig a bit deeper: torchquad makes heavy use of vectorization to attain speedups (see e.g. here for some more info).
The larger the tensor in these vectorized operations are, the better your scaling will be as your will be increasingly compute-bound.
Uh, fun :D What are the dimensionalities of your matrices? So, the key point is here:
This is similar to problems faced in many applications and the key point to remedy this would be to build an integrator that, for i <= K, j <= L, instead of running K x L times the above computation-heavy line, merges them into one computation. Another perspective than the composite integration I mentioned before of looking at this would be more in a computer science way that we are basically doing a "batch" integration. More concretly: Let's take trapezoid for simplicity as example: Implementation-wise it could be, e.g., trapezoid_batch.py or such. And what would need to change is that here where we apply the trapezoid as and the code currently is for cur_dim in range(dim):
cur_dim_areas = (
hs[cur_dim] / 2.0 * (cur_dim_areas[..., 0:-1] + cur_dim_areas[..., 1:])
)
cur_dim_areas = anp.sum(cur_dim_areas, axis=dim - cur_dim - 1) Now, off the top of my head, practically, for cur_dim in range(dim-1):
cur_dim_areas = (
hs[cur_dim] / 2.0 * (cur_dim_areas[..., 0:-1] + cur_dim_areas[..., 1:])
)
cur_dim_areas = anp.sum(cur_dim_areas, axis=dim - cur_dim - 1) So the first dimension is never collapsed / summed over and instead then contains the values for your i,j, (in a flattened way for simplicity?). This should give a considerable speedup because, instead of performing K x L times the operation over small tensors you do it once over a big one. Memory consumption would ofc increase. Makes sense?
Haha, okay :) . Which chair are you doing it at if you don't mind me asking? Just curious.
I don't know what the scipy implementation does internally. But, in practice, in your example you are using a fairly low N=37 and dim=1. So there is not so much to vectorize over. This would however change if we implement this. Because then you vectorize over a K x L x 37 x dim tensor. See also here: https://github.com/esa/torchquad#performance The continuous lines are GPU. So for ~3e5 points evaluated the GPU starts outscaling the CPU. |
Ok yes, this makes a lot of sense - thanks for the concrete code you point to. I will have a closer look. I also probably need a primer on numerical integration so I will probably read the wikipedia pages more closely. If you have any decent sources, I'd welcome them as well (something concise, ideally, but enough to give me the nuts and bolts needed as well). Is what you're getting at here that the solution might be simpler than the one you proposed above previously? Just curious, I'm going to need to spend some time understanding things first anyway since I really don't know much about numerical integration (but do have some experience with GPU programming fortunately).
I think mathematical biology is the answer to your question.
It depends. It could be small, like |
It also sounds like you're fairly optimistic this will yield a performance gain so I think I will probably devote time to this then in the near-ish future. |
I also agree with your assessment given what little I know that the speedup would probably be large. |
*Hopefully large relative to both scipy and the current state, not just the current state... |
There are some nice youtube tutorials We can also have a quick call and I can walk you through the code if that would help.
I think it's quite similar but I formulated it a bit more concretely now (but only covers the bullet point cases, which yours falls under). In general, If you think about it, torchquad is only a subset of possible integration problems you can solve given that we only support n-cube integration domains (see https://github.com/sigma-py/quadpy for an impression what one can implement) functions mapping to single real (or maybe complex, I forgot 🤔 ) values etc. So there are overall many things, one could implement. But, I think, the above described is very immanently useful as it would solve your problem, a related one on another project I am working on (https://github.com/darioizzo/geodesyNets/ computing gravitational potential many times for different points) and also, others have asked me via email for this feature, so it's a good thing to go for.
Okay, yes for these you should definitely see a huge speedup given that instead of performing 1e3 - 1e6 times a small matrix addition , you will be performing one big one. |
Great, I will get to work soon then. I think a call could be helpful once I give this an initial shot and do a bit more reading. Definitely in the new year! Thanks so much! |
Perfect! Just ping when me you wanna have a call. Happy holidays until then! |
@gomezzz Sorry, a related question, and a bit more open ended. One thing I want to do is be able to take repeated versions of this grid-integral over increasing time horizons i.e nested domains with the same starting point but different endpoints (but still uniformly over the matrix, so again different yet simpler than the OP of this issue). Do you have any suggestions? Is https://torchquad.readthedocs.io/en/main/tutorial.html#speedups-for-repeated-quadrature the best way, do you think (but once this feature is done)? |
But the evaluated function doesn't change, right? (if it does, then you need to reevaluate most of it, I think)
so, the matrix is 10x10 corresponding on a [0,1] function interval and then 15x10 (or 15x15 or sth) for [0,1.5] over the same function? |
@gomezzz no not exactly, sorry i wasn't clear - more like a 10x10 matrix on the [0,1] interval and then a 10x10 on [0, 1.5] |
Ok, so no hope of keeping any matrix values, I guess. I think you best bet would then be something like: integrator = Boole()
# precompute compute a grid on what you expect to be maximum domain
integration_domain = [0,10]
grid = integrator.calculate_grid(N, integration_domain)
# sample function one time on the large domain
function_values, _ = integrator.evaluate_integrand(integrand1, grid_points)
# Now subsample the function values for the subdomain and apply the any other terms you are currently interested in
subdomain_function_values = (... you need to code this :P)
subdomain_n_per_dim = (get from above subsampled)
# Compute for integral for that subdomain
integral1 = integrator.calculate_result(subdomain_function_values, dim, subdomain_n_per_dim, hs) You could get more speedup if you store more parts of the computations. It also depends where your computational bottleneck is. Whether the function evaluations are expensive, or the calculate_result in the end etc. Helps? |
Right, the integrand should be fairly inexpensive (just some exponential sum/quotients) so this looks like the move. Thanks! |
I got the newton-cotes to work. It seems. I will need to add some tests to be sure, but it looked good. It seems like there's a floor on how long things take - it's dependent on size of the grid and size of |
This lines up right with what you said previously as well, @gomezzz. |
Is there any rule of thumb here @gomezzz? Like, when would |
Cool! Very nice! Feel free to open a (draft) PR as soon as you want any feedback. Maybe there are further optimization opportunities. ✌️ In general, N=37 is also quite small, not sure how much accuracy you need? You could also try if Trapezoid with larger N chosen so that you get similar errors as with Simpsons gives you better scaling. |
So, it all depends on devices and problem complexity.
And now the devices come in, because on CPU the vectorization doesn't matter as much here, I think. on GPU, if you device is quick, it may be memory bound for small problems. If your integrand is something expensive containing sins, sqrt etc. the evaluations may dominate. |
For a deeper dive into the topic, also see the mentioned master thesis: https://mediatum.ub.tum.de/604993?query=Hofmeier&show_id=1694577 :) |
Yes, I will definitely need your opinion on some API stuff. I think I want to do the non-deterministic ones as well and then open one PR for both.
Hmmm yes, I think I will have to look closer. I don't think accuracy is crazy important, to the point of it needing to be ridiculous, especially since the function is basically constant for most of the bounds of integration. Thanks for the feedback! |
Also, yes it's possible my integrand is expensive since it involves a sum/quotient of complex exponentials (which maybe are evaluated as sin/cos?) but really don't know. So that then this sort of caching would be big, in theory. |
Issue (how-to)
Problem Description
I need to perform numerical integration on each element of a tuple of tensor. The tuple are the parameters of a normal distribution. The integration domain can be determined by these as well. From my understanding, torchquad allows to perform one integration at a time. I could repeatedly launch an integration in a for loop fashion for each element of the tensor: that however sounds counterintuitive with the practice of tensor libraries. Is there a way for me to do it properly ?
Here is a gist with the code for the problem described: https://gist.github.com/sebastienwood/a8aafd4b4480e12c7dae3923d89d7a4c
Let me know if I can provide more information !
The text was updated successfully, but these errors were encountered: