performance issues on a single MPI process #789

coquelin77 · 2021-06-08T14:00:58Z

Description
Simple operations are showing a massive slowdown vs numpy/torch when run on a single mpi process.

To Reproduce
Steps to reproduce the behavior:

Which module/class/function is affected?
- so far, it seems to only be the binary ops, however it may effect other things as well.
What are the circumstances under which the bug appears?
- single mpi process runs on CPU show the behavior most clearly
What is the exact error message / erroneous behavior?
- the times of all of the processes should be very similar

Expected behavior
the times of all three packages should be similar.

Illustrative

  import heat as ht
  import numpy as np
  import time
  arrsize = 100000000
  times = []
  for _ in range(10):
      a = ht.ones(arrsize)#, split=0)
      b = ht.ones(arrsize)#, split=0)
      t1 = time.perf_counter()
      c = a * b + a ** 2 + b ** 2 + a - b
      t2 = time.perf_counter()
      times.append(t2 - t1)
  print(np.mean(times))

1 process heat: 2.7430931525974303
1 process torch: 1.3330379242979689

with a bit more digging i can see that this might be happening within the a ** 2 operation.

Version Info
Current master (184f112)

Additional comments
Thank you to @fschlimb for pointing this out

The text was updated successfully, but these errors were encountered:

ClaudiaComito · 2021-06-09T17:24:13Z

~~Torch binary operations seem to be much more efficient on scalars than on torch tensors.~~ EDIT: torch.pow() seems to be a lot more efficient when the exponent is a scalar, than when it is a tensor.
However, Heat always casts scalars to DNDarrays, hence internally Heat's binary ops are always between torch tensors.

Using the script above, but with c = a**2. On my laptop, with mpirun -n 1 , I get the following:

heat mean time:  0.8049912449333334
numpy mean time:  0.70067716212
torch mean time:  0.5644471529142856

However, if I replace a**2 with a**torch.tensor(2) (or a**np.array(2) for numpy) I get the following:

heat mean time:  0.814409041133333
numpy mean time:  0.7144210714799999
torch mean time:  0.7410358746

Interesting that for numpy it doesn't make a difference.

I'll think about how/if to simplify the __binary_op() checks and handling of scalars, or @coquelin77 do you want to take over?

ClaudiaComito · 2021-06-10T05:45:48Z

Some single-process test runs (all arrays float32): (EDIT: values corrected after fixing mistake in test code, see below, thanks @fschlimb )

Operation	Heat runtime (s)	NumPy runtime (s)	PyTorch runtime (s)	Heat/PyTorch
`a**2`	0.84	0.24	0.24	3.5
`a*array(2)` ()	0.83	0.24	0.83	1
`a*a`	0.23	0.23	0.23	1
`a*2`	0.24	0.23	0.23	1
`a+b`	0.25	0.26	0.25	1
`a+2`	0.26	0.23	0.24	1.08
`ab + a2 + b*2 + a-b`	2.87	0.91	1.70	1.69
`ab + aa + b*b + a-b`	1.81	0.92	1.70	1.06

(*) a**np.array(2), a**torch.tensor(2)

coquelin77 · 2021-06-10T06:07:08Z

what is your data size for the benchmarks?

ClaudiaComito · 2021-06-10T06:09:01Z

what is your data size for the benchmarks?

Hey Daniel, I'm running the exact same example above for different operations, only difference I'm setting dtype.

coquelin77 · 2021-06-10T06:12:22Z

okay i was just curious. the ** is very bad. but this is a direct call to the binary op. im not sure why it would take so much longer since it should just call torch directly. the only way that i can think of to really increase the speed would be to put a try-except block at the top of the binary op and attempt it without the isinstance checks or adjustments. or another form of early out might work too

coquelin77 · 2021-06-10T06:32:29Z

or maybe we just dont use binary op in pow

fschlimb · 2021-06-10T09:18:07Z

Hm, the difference I see is not just a few %, it's a factor of 3 slower than numpy and 10x slower than torch. Something else might be going on.

@ClaudiaComito If you give me your code and tell me which versions of numpy, torch, heat etc you use (or a conda env yaml file) I could generate the same data for one or 2 of our machines and also compare different python envs.

coquelin77 · 2021-06-10T14:06:05Z

I cant speak for Claudia but I did a bit of testing and with a minor rewrite of the pow function i was able to close the gap a bit:

orginal

heat 19.422133721411228
np 4.610380478855222
torch 5.885436344053597

new

heat 6.7831931237131355
np 4.600037970021367
torch 5.761834943573922

these are measurements of the full operation (c = a * b + a ** 2 + b ** 2 + a - b) with arrsize=1_000_000_000 to make the differences more obvious. This is with torch 1.7.0, numpy 1.19.1, and a branch off the current master for heat. I am not sure if this is the Intel numpy.

there is still some space for improvement but there are some checks that we have to perform at the top of these operations to make sure that everything is in correct.

If anyone wants to play around with the new code, you can find it on this branch: bug/789-pow-binary-op-performance

These optimizations will only take effect if one of the inputs is a DNDarray.

ClaudiaComito · 2021-06-12T04:29:26Z

Hm, the difference I see is not just a few %, it's a factor of 3 slower than numpy and 10x slower than torch. Something else might be going on.

@ClaudiaComito If you give me your code and tell me which versions of numpy, torch, heat etc you use (or a conda env yaml file) I could generate the same data for one or 2 of our machines and also compare different python envs.

@fschlimb at last here's my code, which I ran on the current main branch, and my .yaml. Looking forward to your tests. Have a nice weekend!

binop_test.zip

fschlimb · 2021-06-14T11:15:32Z

@ClaudiaComito Thanks. For some reason the numbers look reasonable today. Not only the env that uses similar package versions as you do, but only in my old one (but I have updated HeAT). Not sure what happened. Let's see if it comes up again.

Just fyi: the script did not report correct times because the times list does not get reset between array implementations.

I also made some basic scalability experiments (2 nodes). By default, there is barely any speedup when adding another node. Something seems to be going wrong with OpenMP. In my case default numthreads should be 4; if I set it to 4 on single-node the performance is indeed similar to not setting OMP_NUM_THREADS. It scales well with OMP_NUM_THREADS set (to 1,2 and 4) on each node but not without. I see this when running MPI on one as well as on two nodes. OpenMP scalability seems fine on a single node as well. Any idea?

coquelin77 · 2021-06-17T14:03:42Z

The OpenMP issue is interesting. We do not do anything with threading and rely on torch to handle this for us. During some internal testing, we found that the optimum configuration for our testing cluster was 2 MPI processes each with 12 OMP threads, however this is likely unique to our system.

Using this setup, i have not been able to reproduce the speedup issues when adding nodes. If the split axis is set in the example above, I found that (with and without the efficiency fixes done in #793) the calculation has a speedup proportional to the number of MPI processes. if split=None then the calculation is run on all processes concurrently and there will be no intentional speedup.

ClaudiaComito · 2023-04-26T03:49:52Z

Performance improvement in (upcoming) version 1.3. Speed up wrt torch.pow() (1 MPI process).

ClaudiaComito self-assigned this Jun 9, 2021

coquelin77 mentioned this issue Jun 10, 2021

Bug/789 pow binary op performance #793

Closed

4 tasks

ClaudiaComito added the heat-dev week label Apr 4, 2022

ClaudiaComito removed the heat-dev week label Feb 10, 2023

ClaudiaComito mentioned this issue Apr 18, 2023

heat.pow() speed-up when exponent is int #1141

Merged

4 tasks

mtar closed this as completed in #1141 Apr 24, 2023

ClaudiaComito added bug Something isn't working arithmetics labels Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issues on a single MPI process #789

performance issues on a single MPI process #789

coquelin77 commented Jun 8, 2021

ClaudiaComito commented Jun 9, 2021 •

edited

Loading

ClaudiaComito commented Jun 10, 2021 •

edited

Loading

coquelin77 commented Jun 10, 2021

ClaudiaComito commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

fschlimb commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

ClaudiaComito commented Jun 12, 2021

fschlimb commented Jun 14, 2021

coquelin77 commented Jun 17, 2021

ClaudiaComito commented Apr 26, 2023 •

edited

Loading

performance issues on a single MPI process #789

performance issues on a single MPI process #789

Comments

coquelin77 commented Jun 8, 2021

ClaudiaComito commented Jun 9, 2021 • edited Loading

ClaudiaComito commented Jun 10, 2021 • edited Loading

coquelin77 commented Jun 10, 2021

ClaudiaComito commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

fschlimb commented Jun 10, 2021

coquelin77 commented Jun 10, 2021

orginal

new

ClaudiaComito commented Jun 12, 2021

fschlimb commented Jun 14, 2021

coquelin77 commented Jun 17, 2021

ClaudiaComito commented Apr 26, 2023 • edited Loading

ClaudiaComito commented Jun 9, 2021 •

edited

Loading

ClaudiaComito commented Jun 10, 2021 •

edited

Loading

ClaudiaComito commented Apr 26, 2023 •

edited

Loading