Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfomance loss without export OMP_NUM_THREADS=1 #596

Open
remy-abergel opened this issue Dec 2, 2024 · 16 comments
Open

Perfomance loss without export OMP_NUM_THREADS=1 #596

remy-abergel opened this issue Dec 2, 2024 · 16 comments

Comments

@remy-abergel
Copy link

remy-abergel commented Dec 2, 2024

Hi,

I am doing some benchmarks on a new laptop and I isolated some particular settings leading to significant performance losses depending on whether export OMP_NUM_THREADS=1 is used or not before launching Python. Could you please try to reproduce the following?

Install FINUFFT in a fresh virtual environment using pip

# create fresh environment
python3 -m venv ~/.venv/finufft
source ~/.venv/finufft/bin/activate

# install finufft
pip install finufft   
pip install packaging # needed otherwise `import finufft` will fail (this is a bug right?)

Notice the comment mentioning a potential missing dependency related to finufft installation with pip (but this is not really what has brought me here).

Benchmarking issue

Open a Python console from the terminal and copy-paste the following.

import numpy as np
import time
import finufft

M = 50000
N1, N2, N3 = 50, 25, 25
dtype = np.float32
nrepeat = 10

# test nufft3d1
for id in range(nrepeat):
    
    # generate nodes
    x = np.linspace(-np.pi, 0., M, dtype=dtype)
    y = np.linspace(-np.pi, np.pi, M, dtype=dtype)
    z = np.linspace(-np.pi, np.pi, M, dtype=dtype)
    c_re = np.linspace(-1, 1, M, dtype=dtype)
    c_im = np.linspace(-1, 1, M, dtype=dtype)
    c = c_re + 1J * c_im
    
    # compute 3D NUFFT (type 1)
    start_cpu = time.perf_counter()
    out = finufft.nufft3d1(x, y, z, c, n_modes=(N1, N2, N3), out=None, eps=1e-5)
    end_cpu = time.perf_counter()
    etime_ms = (end_cpu - start_cpu) * 1000.
    print("pass %d : elapsed time = %g ms" % (id, etime_ms))

Here is what I get (it is very slow excepting for pass 1 in this run, the observed execution time can change quite a lot from one pass to another):

pass 0 : elapsed time = 7868.64 ms
pass 1 : elapsed time = 17.4045 ms
pass 2 : elapsed time = 7656.45 ms
pass 3 : elapsed time = 7750.34 ms
pass 4 : elapsed time = 7584.46 ms
pass 5 : elapsed time = 7802.8 ms
pass 6 : elapsed time = 7915.14 ms
pass 7 : elapsed time = 2921.09 ms
pass 8 : elapsed time = 3109.38 ms
pass 9 : elapsed time = 6330.15 ms

Now open a terminal and type

export OMP_NUM_THREADS=1
python

then copy-paste again the benchmarking code above. Here is what I get (much faster and repeatable execution times):

pass 0 : elapsed time = 4.60164 ms
pass 1 : elapsed time = 4.32843 ms
pass 2 : elapsed time = 4.23474 ms
pass 3 : elapsed time = 4.25191 ms
pass 4 : elapsed time = 4.6779 ms
pass 5 : elapsed time = 5.35747 ms
pass 6 : elapsed time = 5.08332 ms
pass 7 : elapsed time = 4.82096 ms
pass 8 : elapsed time = 5.05432 ms
pass 9 : elapsed time = 4.17692 ms

Additional comments

  • Without export OMP_NUM_THREADS=1, changing dtype = np.float32 into dtype = np.float64 drops the computation time to roughly 40ms per pass.
  • The observed behavior seems dependent of the choice of (N1, N2, N3), for instance, N1 = N2 = N3 = 50 leads to roughly 40ms per pass with dtype = np.float32 and without export OMP_NUM_THREADS=1.

Environment

  • OS: Ubuntu 24.04.1 LTS
  • Python: 3.12.3
  • FINUFFT: 2.3.0
  • CPU Type: Intel Core i9-13950H (24 cores HT, 36 Mo cache)
  • GPU: NVIDIA RTX 4000 ADA 12Go DDR6
@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

Hi Remy,

Thanks for the investigation.
pr #586 might already fix the package issue, should be in next release.

I get slow down with HT on. I have 12 physical cpu cores, with OMP_NUM_THREADS=12, it's good.
Could you try different OMP_NUM_THREADS say, 2,4,8,12,24,etc?

Not sure how omp threads behave on Intel Core i9-13950H, i9-13950H seems to have 8 performance cores and 16 efficient cores.

@remy-abergel
Copy link
Author

Thank you very much for your feedback. Here is what I get on my system:

OMP_NUM_THREADS execution time
1 to 24 2 ms to 10 ms
25 4 ms to 24 ms
26 9 ms to 17 ms
27 9 ms to 17 ms
28 9 ms to 649 ms
29 10 ms to 4387 ms

I see no significant difference for OMP_NUM_THREADS <= 8 and 9 <= OMP_NUM_THREADS <= 24.

Should I try to install FINUFFT from the Github sources until the next release?

PS: is packaging a missing dependency for pip installation? (see comment in the first post)

@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

Should I try to install FINUFFT from the Github sources until the next release?

I'm not sure if the current master fix the omp hyper threading slow down, could you try install finufft==2.2.0 to see if you get slow down with threads >= 28 and you could also install finufft from GitHub master branch source to see if you still get slow down. If finufft==2.2.0, finufft==2.3.0 and github master branch, all the three installations give you similar slow down with default hyper theading number of omp threads, then for now you may need to disable hyper threading and choose your best number of omp threads(maybe just set omp num threads to number of physical cores, say 24 in your case)

PS: is packaging a missing dependency for pip installation? (see comment in the first post)

pr #586 might already fix the package issue, see commit 53515f7, it's not in 2.3.0, should be in next 2.3.1.

@remy-abergel
Copy link
Author

remy-abergel commented Dec 2, 2024

pr #586 might already fix the package issue, see commit 53515f7, it's not in 2.3.0, should be in next 2.3.1.

Sorry I misunderstood your first post, I thought this PR was related to an upcoming fix for OMP threads.

I'm not sure if the current master fix the omp hyper threading slow down, could you try install finufft==2.2.0 to see if you get slow down with threads >= 28 and you could also install finufft from GitHub master branch source to see if you still get slow down.

I still observe the OMP slow down issue with finufft==2.2.0 and finufft==2.3.0. I tried to install from GitHub master branch with

pip install .

from the python/finufft directory of the repo, the installation works but the finufft import from the Python console fails so I could not test further:

>>> import finufft
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/remy/softwares/finufft/python/finufft/finufft/__init__.py", line 15, in <module>
    from finufft._interfaces import Plan
  File "/home/remy/softwares/finufft/python/finufft/finufft/_interfaces.py", line 20, in <module>
    import finufft._finufft as _finufft
  File "/home/remy/softwares/finufft/python/finufft/finufft/_finufft.py", line 65, in <module>
    raise ImportError('Failed to find a suitable finufft library. '
ImportError: Failed to find a suitable finufft library. Please check your installation, finufft does not seem to be installed correctly.

If finufft==2.2.0, finufft==2.3.0 and github master branch, all the three installations give you similar slow down with default hyper theading number of omp threads, then for now you may need to disable hyper threading and choose your best number of omp threads(maybe just set omp num threads to number of physical cores, say 24 in your case)

Disabling hyperthreading and/or manually setting the OMP_NUM_THREADS variable is ok for me but I am using finufft as a dependency of a package that I developed (see here) so I am looking for a more automatic way to do that until a fix is available. I am thinking about manually setting the OMP_NUM_THREADS by adding

# temporary fix for FINUFFT issue #596: force OMP_NUM_THREADS=1
import os
os.environ["OMP_NUM_THREADS"] = "1"

at the beginning of the __init__.py file of my package. I just tested and it seems to work. I can also improve to automatically set OMP_NUM_THREADS to the number of physical cores using psutil which seems plateform independent:

# temporary fix for FINUFFT issue #596: force OMP_NUM_THREADS = number of physical cores
import os
import psutil
os.environ["OMP_NUM_THREADS"] = str(psutil.cpu_count(logical=False))

Do you have a better idea?

@mreineck
Copy link
Collaborator

mreineck commented Dec 2, 2024

You should be able to pass an nthreads argument to all functions of the Finufft Python interface. Perhaps that could solve the issue more elegantly?

One small remark: the test case you are showing is quite small, so it's perhaps not surprising that it doesn't show a big speedup when increasing the number of threads fom 1 to 24. Multithreading can only really show benefits for transforms with larger grids / lrger numbers of nonuniform points.

@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

You should be able to pass an nthreads argument to all functions of the Finufft Python interface. Perhaps that could solve the issue more elegantly?

Yes, agreed. python interface kwargs should support nthreads

@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

from the python/finufft directory of the repo, the installation works but the finufft import from the Python console fails so I could not test further:

I tried pip install python/finufft in the finufft root directory, it works for me.

@remy-abergel
Copy link
Author

I tried pip install python/finufft in the finufft root directory, it works for me.

In my case installation works fine but import raises the error mentioned above.

You should be able to pass an nthreads argument to all functions of the Finufft Python interface. Perhaps that could solve the issue more elegantly?

One small remark: the test case you are showing is quite small, so it's perhaps not surprising that it doesn't show a big speedup when increasing the number of threads fom 1 to 24. Multithreading can only really show benefits for transforms with larger grids / lrger numbers of nonuniform points.

Yes, agreed. python interface kwargs should support nthreads

Thank you very much for your this suggestion, I will use a decorator to change the default value of the nthreads keyword argument of the finufft functions according to the number of physical cores (or the OMP_NUM_THREADS environment variable if set).

Please let me know when a fix is done in a future release so that I can remove my temporary fix.

Thanks you again.

@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

In my case installation works fine but import raises the error mentioned above.

Most probably it's because you are in python/finufft directory, python try to load the finufft under the current working directory.
Could you try to switch to other directories and try to import finufft and see what happens?

@remy-abergel
Copy link
Author

Most probably it's because you are in python/finufft directory, python try to load the finufft under the current working directory.

You were right, thank you! I could install from Github master branch. I still observe a slight slow down with OMP_NUM_THREADS > 24 (~ 40ms / pass) but this is much better than what I described above with finufft v2.3.0 (~ 8000 ms / pass). My temporary fix still leads better perfs for the moment.

@lu1and10
Copy link
Member

lu1and10 commented Dec 2, 2024

In my case installation works fine but import raises the error mentioned above.

Most probably it's because you are in python/finufft directory, python try to load the finufft under the current working directory. Could you try to switch to other directories and try to import finufft and see what happens?

Glad the master branch slow down on hyper threading is alleviated, not sure what is improvement though. Mostly on CPU side is @mreineck code templatizing and clean up. Maybe @mreineck also improved the openmp scheduling?

@DiamonDinoia
Copy link
Collaborator

It could be that it is using a different openmp library. One that is more recent. Also, compiling from source enable -march=native that can further tailor the code to the system used. As per docs, we recommend installing from source for best performance :)

@DiamonDinoia
Copy link
Collaborator

@lu1and10 it seems that even in the cpu code, now that everything is vectorized we are memory bound. Hence, hyper-threading might actually impair performance as it stresses the memory controller more. It might be worth using only physical cores with openmp?

Also we might need to tweak the auto nthreads heuristic as it seems too aggressive on the number of threads.

@lu1and10
Copy link
Member

lu1and10 commented Dec 4, 2024

@lu1and10 it seems that even in the cpu code, now that everything is vectorized we are memory bound. Hence, hyper-threading might actually impair performance as it stresses the memory controller more. It might be worth using only physical cores with openmp?

Also we might need to tweak the auto nthreads heuristic as it seems too aggressive on the number of threads.

Agreed, I wonder for what problem size hyper threading actually helps, if not maybe just simply default to number of physical cores is fine.

@mreineck
Copy link
Collaborator

mreineck commented Dec 4, 2024

I'm not aware of changing anything related to multithreading, so it's a pleasant surprise that things are working better on the master branch :-)
Concerning whether we are memory-bound: I think this is only really the case for very accurate/high-dimensional transforms. In 1D we should definitely be CPU-bound, in 3D we are probably memory bound, in 2D I don't know.

@remy-abergel
Copy link
Author

I'm not aware of changing anything related to multithreading, so it's a pleasant surprise that things are working better on the master branch :-)

Indeed it's a good surprise :-) However, I found that the issue related above is actually very dependent on the values of N1, N2, N3, M, dtype and the machine used for the test. I will run a bunch of tests (without my temporary fix) and give you a feedback soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants