Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Tests for FNO and DeepONets #17

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ayushinav
Copy link
Contributor

@ayushinav ayushinav commented Jul 26, 2024

To fix #13

I had some issues with CUDA and all when installing torch CUDA toolkit.

@ayushinav
Copy link
Contributor Author

ayushinav commented Jul 26, 2024

I guess a more appropriate place for this would be SciMLBenchmarks? @avik-pal

@avik-pal
Copy link
Member

Yes the CPU ones can go to SciMLBenchmarks. How are the julia native ones so slow? Did you run the profiler to see where the bottlenecks are?

@ayushinav
Copy link
Contributor Author

For DeepONet, most of the time goes to the layers, iiuc
image

For FNOs, ig ffts are a bit expensive
image

@avik-pal
Copy link
Member

Couple of things to check:

  1. Load MKL.jl before calling the neural networks
  2. Is the number of threads consistent between julia and python?

@avik-pal
Copy link
Member

Also that compile function might be compiling the torch model with dynamo. It is pretty much impossible to beat that with Lux running in eager mode. You could try and compile the DeepONet model with EnzymeAD/Reactant.jl#55 and see the performance.

@avik-pal
Copy link
Member

Also looking at the plots you are on quite an old version of LuxLib, update it, and it should address some performance

@avik-pal
Copy link
Member

@ayushinav can you install LuxDL/LuxLib.jl#111 and let me know how the performance is?

@avik-pal
Copy link
Member

avik-pal commented Aug 3, 2024

I checked the recent lux releases. The current problems are

  1. __project isn't fast enough. This needs to be rewritten to use LoopVectorization on CPU
  2. batched_mul -- I am looking into this in LuxLib, this is quite easy to fix
  3. allocations are expensive. Nothing much we can do there honestly.

The current numbers for Lux on this PR are single threaded, Pytorch uses all cores by default.

@ChrisRackauckas
Copy link
Member

Can we make this into a SciMLBenchmarks script? That will be easier to maintain in the long run.

We can make that support GPU

@avik-pal

This comment was marked as outdated.

bench/pytorch.py Show resolved Hide resolved
@avik-pal
Copy link
Member

avik-pal commented Aug 4, 2024

Haven't looked into the FNO version much, but that will most likely need LuxDL/LuxLib.jl#118 for performance. To summarize the issue there:

  1. gelu and friends are surprisingly expensive operations, and fusing them into the fused_dense operations slows down the entire loop. But that is not true in general. For example, tanh/relu/abs, etc. can and should be fused into the main loop for performance. So we need to extend the current implementation with additional traits to see which activations can be fused for performance.
  2. But the main perf gain will come from better gradient ops -- even for gelu. Currently, we fail to use LoopVectorization in that case but if we hardcode some of the common cases the performance will significantly improve.

@avik-pal avik-pal force-pushed the bm_docs branch 2 times, most recently from dd5b486 to df49692 Compare August 4, 2024 21:59
@avik-pal
Copy link
Member

avik-pal commented Aug 7, 2024

@ayushinav can we get this finished?

@avik-pal
Copy link
Member

avik-pal commented Aug 8, 2024

@ayushinav bump

@ayushinav
Copy link
Contributor Author

@avik-pal
Yes, I'm on this. I need to write the gradients for the new __projects using LoopVectorization. The projection becomes faster then, but there isn't a significant speed up for the overall network.

@avik-pal
Copy link
Member

avik-pal commented Aug 8, 2024

Do that in a separate PR, let's get the benchmarks aligned first. The pytorch ones use a different size it seems.

@ayushinav
Copy link
Contributor Author

ayushinav commented Aug 9, 2024

The sizes are now aligned. The python variant of DeepONet only supports 1 eval point (in the unaligned case), and the Flux variant doesn't support batching. To have the same size for inputs, I made the batch size and the eval points same to compare with both the variants.

The Flux variant of FNO only supported a fixed length of kernels, which is fixed here.

The difference in size now is because python uses (batch_size, N) whereas julia uses (N, batch_size)

@avik-pal
Copy link
Member

avik-pal commented Aug 9, 2024

I am guessing the overhead in FNO is currently from fft?

@ayushinav
Copy link
Contributor Author

Fair share of fft and matmuladd, though the share of later has increased

image

@avik-pal
Copy link
Member

avik-pal commented Aug 9, 2024

Can you also profile the backward pass for the FNO? I am surprised it is that bad

@ayushinav
Copy link
Contributor Author

For mse loss, profiling the backward as gradient(ps -> loss(first(model(x, ps, st)), y), ps)

image

@avik-pal
Copy link
Member

using the permuted formulation it is now all just fft time in forward and backward. It is quite surprising that our FFT is so much slower than pytorch.

@ayushinav might be worth giving https://github.com/Taaitaaiger/RustFFT.jl a shot and checking the performance on CPUs

@avik-pal avik-pal force-pushed the bm_docs branch 2 times, most recently from b613796 to c2439c6 Compare August 26, 2024 16:25
@avik-pal avik-pal force-pushed the bm_docs branch 2 times, most recently from 5450203 to f10c0fb Compare September 28, 2024 01:53
dependabot bot and others added 8 commits September 27, 2024 21:58
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[skip ci] [skip docs]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Benchmarks
3 participants