-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issues on a single MPI process #789
Comments
Using the script above, but with
However, if I replace
Interesting that for numpy it doesn't make a difference. I'll think about how/if to simplify the |
Some single-process test runs (all arrays
(*) |
what is your data size for the benchmarks? |
Hey Daniel, I'm running the exact same example above for different operations, only difference I'm setting |
okay i was just curious. the |
or maybe we just dont use binary op in pow |
Hm, the difference I see is not just a few %, it's a factor of 3 slower than numpy and 10x slower than torch. Something else might be going on. @ClaudiaComito If you give me your code and tell me which versions of numpy, torch, heat etc you use (or a conda env yaml file) I could generate the same data for one or 2 of our machines and also compare different python envs. |
I cant speak for Claudia but I did a bit of testing and with a minor rewrite of the pow function i was able to close the gap a bit: orginal
new
these are measurements of the full operation ( there is still some space for improvement but there are some checks that we have to perform at the top of these operations to make sure that everything is in correct. If anyone wants to play around with the new code, you can find it on this branch: bug/789-pow-binary-op-performance These optimizations will only take effect if one of the inputs is a DNDarray. |
@fschlimb at last here's my code, which I ran on the current main branch, and my .yaml. Looking forward to your tests. Have a nice weekend! |
@ClaudiaComito Thanks. For some reason the numbers look reasonable today. Not only the env that uses similar package versions as you do, but only in my old one (but I have updated HeAT). Not sure what happened. Let's see if it comes up again. Just fyi: the script did not report correct times because the I also made some basic scalability experiments (2 nodes). By default, there is barely any speedup when adding another node. Something seems to be going wrong with OpenMP. In my case default numthreads should be 4; if I set it to 4 on single-node the performance is indeed similar to not setting OMP_NUM_THREADS. It scales well with OMP_NUM_THREADS set (to 1,2 and 4) on each node but not without. I see this when running MPI on one as well as on two nodes. OpenMP scalability seems fine on a single node as well. Any idea? |
The OpenMP issue is interesting. We do not do anything with threading and rely on torch to handle this for us. During some internal testing, we found that the optimum configuration for our testing cluster was 2 MPI processes each with 12 OMP threads, however this is likely unique to our system. Using this setup, i have not been able to reproduce the speedup issues when adding nodes. If the split axis is set in the example above, I found that (with and without the efficiency fixes done in #793) the calculation has a speedup proportional to the number of MPI processes. if |
Description
Simple operations are showing a massive slowdown vs numpy/torch when run on a single mpi process.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
the times of all three packages should be similar.
Illustrative
1 process heat:
2.7430931525974303
1 process torch:
1.3330379242979689
with a bit more digging i can see that this might be happening within the
a ** 2
operation.Version Info
Current master (184f112)
Additional comments
Thank you to @fschlimb for pointing this out
The text was updated successfully, but these errors were encountered: