Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GPU solver for reactor simulations using SUNDIALS 3.1 or higher interface #33

Open
bryanwweber opened this issue Feb 22, 2020 · 16 comments
Labels
feature-request New feature request

Comments

@bryanwweber
Copy link
Member

bryanwweber commented Feb 22, 2020

Idea

The matrices representing the systems of equations in Cantera are generally solved using SUNDIALS. Newer versions of SUNDIALS have a better interface to integrate with heterogeneous computing environments. This project will add GPU support to Cantera for reactor network integration.

Difficulty

Medium

Required Knowledge

C++, CUDA/OpenCL

Mentors

@kyleniemeyer

References

@skrsna
Copy link

skrsna commented Mar 10, 2020

Hello Bryan,
I have been exploring using automatic differentiation to solve system of ODEs and @rwest suggested to post about it here. Specifically, I'm interested in JAX (https://github.com/google/jax) which supports forward and backward mode automatic differentiation. JAX has an experimental module that supports ode integration similar to odeint or solve_ivp from scipy.integrate module. But, I'm having some difficulties setting up systems or use Cantera Solution as it is inside JAX because of the way how JAX treats external classes. This is just a heads up that I'm exploring this side of the field as I have little experience with SUNDAILS or C++. Another useful application of JAX is using it to do sensitivity analysis or uncertainty analysis of large systems using auto diff instead of brute force methods by implementing some methods from this paper https://arxiv.org/pdf/1812.01892.pdf .

@bryanwweber
Copy link
Member Author

bryanwweber commented Mar 10, 2020

cc @kyleniemeyer

@skrsna
Copy link

skrsna commented Apr 16, 2020

Following up on my previous comment, it seems like implementing auto-diff for Cantera through JAX is complex cause the way JAX traces or builds computational graphs for custom python classes so I stopped spending more time on that. But, I was able to compile Cantera with Sundials 5.1.0 with CUDA support and I'm starting to use the NVECTOR_CUDA for Reactor setup in Cantera. Feedback, suggestions and comments are highly appreciated cause this is my first time dealing a large C++ codebase.

@bryanwweber
Copy link
Member Author

@skrsna Please feel free to make a pull request or provide a link to a branch on your fork.

@skrsna
Copy link

skrsna commented Apr 23, 2020

Hi @jiweiqi, can you please elaborate on what you meant by "learn reaction kinetic models", are you trying to apply Neural Ordinary Differential Equations (https://arxiv.org/abs/1806.07366) to kinetic models or just using PyTorch autograd backend to obtain Jacobians using autodiff.

@skrsna
Copy link

skrsna commented Jun 19, 2020

Thanks @jiweiqi for the awesome ReacTorch package. We just released our code JAX-reactor. It's similar to ReacTorch but the backend is JAX and leverages JIT compile, vectorization, parallelization and automatic differentiation out of the box. I believe that JAX's autodiff API is more mature than PyTorch for scientific computational methods and we can also write custom differentiation rules which are gonna be useful for sensitivity analysis and uncertainty quantification. Right now profiling the JAX code revealed that bottleneck is copy data between host and the device as I'm using scipy to solve the ODE system. I'm also looking into writing BDF ODE solver in JAX to get around it. Feedback and suggestions are welcome.

@ischoegl
Copy link
Member

ischoegl commented Mar 19, 2021

@bryanwweber ... while the discussion here is definitely useful, it doesn't necessarily help to shed light on the actual topic, i.e. 'Implement GPU solver for reactor simulations using SUNDIALS 3.1 or higher Interface'. If I understand correctly, both of the approaches advertised above bypass the Cantera core and mainly leverage YAML for alternative implementations (another reason for #83 I guess). I'd be quite interested to hear about current thoughts on how GPU calculations can be implemented.

@bryanwweber
Copy link
Member Author

@ischoegl I have no idea! I was just copying GSoC ideas from our wiki page into issues 😄

@speth
Copy link
Member

speth commented Mar 20, 2021

I worked on an implementation of this a while back with @athlonshi, where we modified an old version of SUNDIALS to use MAGMA to factorize the Jacobian and do the linear solves on the GPU. It worked quite well for large mechanisms. If I recall correctly, we were getting something like a factor of 10 speedup for mechanisms with ~1000 species on just a normal consumer (gaming-targeted) GPU. I wouldn't expect the stiffness to affect the speedup, as you're using the same BDF method either way.

We didn't get much further than a proof-of-concept, because at the time I think SUNDIALS didn't have some of the flexibility for specifying the linear solver steps that it gained in more recent versions. I think that is the origin of the reference to "SUNDIALS 3.1" or higher in the initial description of this idea.

@dcmvdbekerom
Copy link

Hi all, I was wondering what the status is of this issue - I'd be willing to contribute to this, though not as GSoC participant.
I have experience with CUDA C++ and chemical kinetics.

@bryanwweber bryanwweber added feature-request New feature request and removed gsoc-idea labels Mar 23, 2023
@bryanwweber
Copy link
Member Author

@dcmvdbekerom Please feel free, as far as I know the status update from 2 years ago represents the most recent progress on this issue.

@speth
Copy link
Member

speth commented Mar 23, 2023

Hi @dcmvdbekerom, any contributions in this area would be greatly appreciated.

One significant thing that has changed since the last update I wrote is that we are now able to use sparse iterative solvers (GMRES) for the reactor network problem, thanks to the implementation of approximate Jacobians for certain reactor types that can be used to construct a preconditioner, which has provided large speed-ups for big mechanisms for the problems where we can use this (@anthony-walker is working on extending the set of problems where these can be applied).

As a result, using the GPU just for dense factorization and direct linear solve is probably much less of a clear benefit. MAGMA does appear to have implementations of algorithms for sparse matrices, including incomplete LU factorization, so I think we could still apply it for the same steps of the integration process. However, with the sparse solver, the linear algebra operations make up a smaller portion of the overall runtime compared to the evaluation of the chemical kinetics / governing equations, and to get the biggest benefits we may want to explore how to move more of those calculations onto the GPU.

If you have any questions, please feel free to ask here or in a "discussion" at https://github.com/Cantera/enhancements/discussions.

@speth
Copy link
Member

speth commented Mar 23, 2023

I remembered that I had done a bit of profiling for the sparse preconditioned solver a while back. The results can be seen here: precon-profiling.pdf. This was for several ignition delay calculations with mechanisms of different sizes. The overall results mostly reflect the behavior on the largest mechanism (~7000 species). The key results are:

  • 65% of the total time is spent evaluating the reactor governing equations.
    • 75% of this time is spent on reaction rate evaluations
    • 14% of this time is spent on evaluating thermodynamic properties
  • 16% of the total time is spent on preconditioner setup
    • 75% of this time is spent on preconditioner factorization
  • 10% of the total time is spent on solves using the factorized preconditioner

Only these last two pieces will benefit from just applying MAGMA to the linear solver operations done by CVODES. If we want to see large benefits of using the GPU, we need to be able to move the evaluation of the reaction rates / net species production rates there.

@dcmvdbekerom
Copy link

Thanks for the warm welcome,

After looking a bit more into the current code and reading your comment I fear I'm way in over my head.
Moreover, I keep forgetting that consumer GPU's, like the one I have, have only few FP64 cores. In my other applications (spectroscopy) it was not a problem to use FP32, but for chemical kinetics this would probably be prohibitively inaccurate.

I'm still interested to see where this could go, but I realize now that the problem is much harder than I initially thought. I naively assumed that a 1D problem could be simulated by solving N reactors in parallel, interacting with only their neighbors at every timestep. It seems it is not so simple however.

I also have get more familiar with Cantera itself - what would be a good place for some newbie questions?

@speth
Copy link
Member

speth commented Mar 24, 2023

If you want to look at a 1D solver that does calculate the solution as you describe, you might look at Ember (https://github.com/speth/ember), which uses Strang splitting. That code is already parallelized across the different reactors using the Intel TBB library (on the CPU, not on the GPU).

For general Cantera questions, I'd suggest asking on the Users' Group, or for a discussion of potential new features, the "Discussions" section here in the Enhancements repo is also an option.

@speth
Copy link
Member

speth commented Jul 31, 2024

Ginkgo looks like an interesting option for sparse systems, providing parallel implementations of factorization/solve algorithms for both CPU and GPU architectures. Recent versions of SUNDIALS (6.4 and newer) also provide some features for working with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature request
Projects
None yet
Development

No branches or pull requests

5 participants