-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference performances decrease with number of cpus used. #5193
Comments
Depends on how much it increased. For thread pool, the more threads you have, the higher scheduling cost is. |
I am doing batched inferences, batch_size=64. Using plain torch on 12 cpus takes 6s, and this time increases when using less cpus (which is expected) With onnxruntime, computation takes 5s on a single cpu, but 12s on 12 cpus. If it's overhead I find it strange that torch does not suffer from it. |
Oh I see, then probably we paralleled some code that shouldn't be paralleled |
Do you have an update on this ? |
How many threads do you have in total? It sounds you created many more threads than the CPU cores, you can make 24 parallel calls to 'run' on separate threads, but you need to disable the multiple threading code in onnxruntime first. |
I have 24 threads on my machine. I am not using any multiprocessing outside of onnx. I am doing a batched inference on a single model using onnxruntime. The odd behaviour is that using taskset to restrict the number of threads used by the python process decreases the inference time ! The less threads, the faster the inference. That is without touching at any point the onnxruntime configuration. I did a second experiment where I do use a multithreading library to run inferences in parallel: each separate process uses onnxruntime on a small batch of images. Doing this yield the best performances for the inference on my 64 images. (In this configuration, I set omp_num_threads to (total_number_of_threads / number_of_parallel_threads) to prevent an overuse of the cpus.) It seems to me that batching the inference within a single inference using onnxruntime should be faster than this manual parallelization. That's what I'm used to with pytorch. |
Could you please share me information to allow me reproduce your experiment? For example, the model, a full python script. So that I can debug what happened inside. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Hi, I'm also noticing that sess.run() is using multiple threads in the python API. Is there any way to disable this? I'd similarly like to call sess.run() in parallel (i.e. call it in different threads as I'm trying to run parallel simulations for monte-carlo analysis) |
Hey,
I am running inferences on a large 3d Unet. I export the net from pytorch to onnx using
and then I'm using it using the syntax
I installed onnx and onnxruntime using pip, and I have open mp on my ubuntu 18 machine. This machine has 12 cores/24 threads, but the inference time per image increases with the number of threads (I measure that using taskset on variable number of cpus).
I was not expecting this behaviour but I'm not fully understanding how this library works. Is this expected ?
In production, I have 200 predictions to make every time, and I am tempted to run them in a multi-threaded way (24 parallel calls to 'run' on separate threads). Do you recommend this approach other using a single inference session on all cpus (and potentially tuning a batch size) ? How can I make this procedure thread-safe and disable openmp multi-threading ?
Thanks in advance !
The text was updated successfully, but these errors were encountered: