Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issues on a single MPI process #789

Closed
coquelin77 opened this issue Jun 8, 2021 · 12 comments · Fixed by #1141
Closed

performance issues on a single MPI process #789

coquelin77 opened this issue Jun 8, 2021 · 12 comments · Fixed by #1141
Assignees
Labels
arithmetics bug Something isn't working

Comments

@coquelin77
Copy link
Member

Description
Simple operations are showing a massive slowdown vs numpy/torch when run on a single mpi process.

To Reproduce
Steps to reproduce the behavior:

  1. Which module/class/function is affected?
    • so far, it seems to only be the binary ops, however it may effect other things as well.
  2. What are the circumstances under which the bug appears?
    • single mpi process runs on CPU show the behavior most clearly
  3. What is the exact error message / erroneous behavior?
    • the times of all of the processes should be very similar

Expected behavior
the times of all three packages should be similar.

Illustrative

  import heat as ht
  import numpy as np
  import time
  arrsize = 100000000
  times = []
  for _ in range(10):
      a = ht.ones(arrsize)#, split=0)
      b = ht.ones(arrsize)#, split=0)
      t1 = time.perf_counter()
      c = a * b + a ** 2 + b ** 2 + a - b
      t2 = time.perf_counter()
      times.append(t2 - t1)
  print(np.mean(times))

1 process heat: 2.7430931525974303
1 process torch: 1.3330379242979689

with a bit more digging i can see that this might be happening within the a ** 2 operation.

Version Info
Current master (184f112)

Additional comments
Thank you to @fschlimb for pointing this out

@ClaudiaComito ClaudiaComito self-assigned this Jun 9, 2021
@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Jun 9, 2021

Torch binary operations seem to be much more efficient on scalars than on torch tensors. EDIT: torch.pow() seems to be a lot more efficient when the exponent is a scalar, than when it is a tensor.
However, Heat always casts scalars to DNDarrays, hence internally Heat's binary ops are always between torch tensors.

Using the script above, but with c = a**2. On my laptop, with mpirun -n 1 , I get the following:

heat mean time:  0.8049912449333334
numpy mean time:  0.70067716212
torch mean time:  0.5644471529142856

However, if I replace a**2 with a**torch.tensor(2) (or a**np.array(2) for numpy) I get the following:

heat mean time:  0.814409041133333
numpy mean time:  0.7144210714799999
torch mean time:  0.7410358746

Interesting that for numpy it doesn't make a difference.

I'll think about how/if to simplify the __binary_op() checks and handling of scalars, or @coquelin77 do you want to take over?

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Jun 10, 2021

Some single-process test runs (all arrays float32): (EDIT: values corrected after fixing mistake in test code, see below, thanks @fschlimb )

Operation Heat runtime (s) NumPy runtime (s) PyTorch runtime (s) Heat/PyTorch
a**2 0.84 0.24 0.24 3.5
a**array(2) (*) 0.83 0.24 0.83 1
a*a 0.23 0.23 0.23 1
a*2 0.24 0.23 0.23 1
a+b 0.25 0.26 0.25 1
a+2 0.26 0.23 0.24 1.08
a*b + a**2 + b**2 + a-b 2.87 0.91 1.70 1.69
a*b + a*a + b*b + a-b 1.81 0.92 1.70 1.06

(*) a**np.array(2), a**torch.tensor(2)

@coquelin77
Copy link
Member Author

what is your data size for the benchmarks?

@ClaudiaComito
Copy link
Contributor

what is your data size for the benchmarks?

Hey Daniel, I'm running the exact same example above for different operations, only difference I'm setting dtype.

@coquelin77
Copy link
Member Author

okay i was just curious. the ** is very bad. but this is a direct call to the binary op. im not sure why it would take so much longer since it should just call torch directly. the only way that i can think of to really increase the speed would be to put a try-except block at the top of the binary op and attempt it without the isinstance checks or adjustments. or another form of early out might work too

@coquelin77
Copy link
Member Author

or maybe we just dont use binary op in pow

@fschlimb
Copy link
Contributor

Hm, the difference I see is not just a few %, it's a factor of 3 slower than numpy and 10x slower than torch. Something else might be going on.

@ClaudiaComito If you give me your code and tell me which versions of numpy, torch, heat etc you use (or a conda env yaml file) I could generate the same data for one or 2 of our machines and also compare different python envs.

@coquelin77
Copy link
Member Author

I cant speak for Claudia but I did a bit of testing and with a minor rewrite of the pow function i was able to close the gap a bit:

orginal

heat 19.422133721411228
np 4.610380478855222
torch 5.885436344053597

new

heat 6.7831931237131355
np 4.600037970021367
torch 5.761834943573922

these are measurements of the full operation (c = a * b + a ** 2 + b ** 2 + a - b) with arrsize=1_000_000_000 to make the differences more obvious. This is with torch 1.7.0, numpy 1.19.1, and a branch off the current master for heat. I am not sure if this is the Intel numpy.

there is still some space for improvement but there are some checks that we have to perform at the top of these operations to make sure that everything is in correct.

If anyone wants to play around with the new code, you can find it on this branch: bug/789-pow-binary-op-performance

These optimizations will only take effect if one of the inputs is a DNDarray.

@ClaudiaComito
Copy link
Contributor

Hm, the difference I see is not just a few %, it's a factor of 3 slower than numpy and 10x slower than torch. Something else might be going on.

@ClaudiaComito If you give me your code and tell me which versions of numpy, torch, heat etc you use (or a conda env yaml file) I could generate the same data for one or 2 of our machines and also compare different python envs.

@fschlimb at last here's my code, which I ran on the current main branch, and my .yaml. Looking forward to your tests. Have a nice weekend!

binop_test.zip

@fschlimb
Copy link
Contributor

@ClaudiaComito Thanks. For some reason the numbers look reasonable today. Not only the env that uses similar package versions as you do, but only in my old one (but I have updated HeAT). Not sure what happened. Let's see if it comes up again.

Just fyi: the script did not report correct times because the times list does not get reset between array implementations.

I also made some basic scalability experiments (2 nodes). By default, there is barely any speedup when adding another node. Something seems to be going wrong with OpenMP. In my case default numthreads should be 4; if I set it to 4 on single-node the performance is indeed similar to not setting OMP_NUM_THREADS. It scales well with OMP_NUM_THREADS set (to 1,2 and 4) on each node but not without. I see this when running MPI on one as well as on two nodes. OpenMP scalability seems fine on a single node as well. Any idea?

@coquelin77
Copy link
Member Author

The OpenMP issue is interesting. We do not do anything with threading and rely on torch to handle this for us. During some internal testing, we found that the optimum configuration for our testing cluster was 2 MPI processes each with 12 OMP threads, however this is likely unique to our system.

Using this setup, i have not been able to reproduce the speedup issues when adding nodes. If the split axis is set in the example above, I found that (with and without the efficiency fixes done in #793) the calculation has a speedup proportional to the number of MPI processes. if split=None then the calculation is run on all processes concurrently and there will be no intentional speedup.

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Apr 26, 2023

Performance improvement in (upcoming) version 1.3. Speed up wrt torch.pow() (1 MPI process).

pow
pow_zoomin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arithmetics bug Something isn't working
Projects
None yet
3 participants