-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gemm execution time is unstable in multi process system #4651
Comments
I'm not sure if I understand your test setup correctly - "six threads each in six processes" reads a bit like the total number of threads exceeds the capacity of your cpu ? That said, current OpenBLAS does not specifically recognize the Cortex A78(?) of the Jetson Orin (treating it as some generic ARMV8 cpu), so GEMM parameters tuned to cache size will probably be wrong |
Thanks for your reply. Only two processes were created and assigned to cpu 1 and 2, and the number of threads to be used for cblas_sgemm() in each process was set to two using openblas_set_num_threads (2).
According to the build results, it seems to recognize ARM64.
In our current code, we change the number of threads repeatedly at runtime. Will continuing to change the number of openblas threads in this way at runtime affect the execution time in certain processes? |
Hello,
We are using NVIDIA Jetson Orin platform and in a multi-process system, each process is assigned to a specific cpu and parallelizes cblas_sgem() sequentially through the remaining cores.
When the calculation is performed with only one process created, the execution time of cblas_sgem() is almost constant, but when the process increases, the calculation time becomes unstable.
Although cblas_sgem() using only one thread took about 25 ms, parallel processing with 6 threads each in 6 processes causes instability in running time from 10 ms to 60 ms.
A detailed description of the situation
We have 11 cpu cores from no. 1 to no. 11.
Create six processes, allocate them from cpu 1 to cpu 6, and synchronize the operation start time and execute them simultaneously.
Locked through Semaphore in the middle and sequentially parallelized cblas_sgem() with a total of 6 threads using 5 remaining cores from cpu 7 to cpu 11.
In this case, cblas_sgem() executed in the process allocated to cores 1 to 5 is about 11 ms, and the execution time is constant.
However, in the last process assigned to core 6, cblas_sgem() lags unsteadily from 30 ms to 60 ms, slower than when running with one thread.
The same code is executed for cores 1 to 6.
The code currently in use is as follows.
openblas_thread = 6;
openblas_set_num_threads (openblas_thread); // Set the number of openblas threads to 6
CPU_ZERO(&cpuset);
CPU_SET(sched_getcpu(), &cpuset);
pthread_setaffinity_np (pthread_self(), sizeof(cpuset), &cpuset); // reallocate itself to the original assigned CPU number (1 to 6, respectively)
for (int k = 0; k < openblas_thread-1; k++) { // allocate openblas_thread to CPU cores (7 to 11), excluding the main process(thread)
CPU_ZERO(&cpuset);
CPU_SET(11 - k, &cpuset);
openblas_setaffinity(k, sizeof(cpuset), &cpuset);
}
for(int i = 0; i < 30; i++) {
gemm_wrapper_func();
}
When I measured the execution time inside the openblas code, it was confirmed that the calculation time became unstable when processing cblas_sgem() called in the wrapper function.
Is the code above correct to use openblas normally?
The text was updated successfully, but these errors were encountered: