You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure if there are future plans to expand OpenMP support in Nim but here is something I'd like to have:
Context
Nim codegen generates all for and while loops as while true loops in the C codegen.
The only exception is Nim || (doc which produces #pragma omp parallel for followed by a normal C for loop.
However there are more OpenMP constructs that requires an actual for loop than #pragma omp parallel for for example #pragma omp for, #pragma omp simd, #pragma omp taskloop#pragma omp distribute, see OpenMP quicksheet.
One example application is a sum reduction due to all threads having to share the result variable.
Hera are the main concerns with reductions:
We have to update a common variables with all threads.
To avoid locking, each thread should keep track of their own sums.
After we merge all partial sums into the final result.
Overview of OpenMP solutions
Solution 1: Reference serial code
funcreduction_serial(s: seq[int]): int=for val in s:
result+= val
Solution 2: using OpenMP reduction clause
OpenMP offers a reduction clause but it's very restrictive in terms of supported operations:
+, *, min, max. And so requires the use of emit to avoid Nim using addInt.
Furthermore due to #9365 we need our own name mangling proc.
The big advantage is that it does all steps 1, 2, 3 for us.
procreduction_reduction_clause(s: seq[int]): int=var sum{.exportc: "sum_"&omp_suffix(genNew =true).} =0const omp_annotation ="reduction(+:sum_"&omp_suffix(genNew =false) &')'# very restrictive in terms of op supported (add, mul, min, max)for i in`||`(0, s.len -1, omp_annotation):
let si = s[i]
{.emit: "`sum` += `si`;".}
return sum
Solution 3: using padding
To be less limited we can use a shared array or seq to store the partial sums, the main issue is determining its size. We can base it on the max number of hardware threads but it might be that within the OpenMP parallel section, not all threads are used.
For example assume we allocate newSeq[int](4) on a machine with 4 threads but OpenMP only uses 2 at runtime, if we do sum, it's fine to add the extra zeroes but if we are doing product/min/max it's not. And omp_get_num_threads() is always 1 when not in a parallel section and we can't use it in a omp parallel for section without triggering it for each thread at each iteration.
Additionally to avoid false sharing we must pad the partial_sums array or seq so that each partial_sum is separated by at least a cache_line (64 bytes).
procreduction_padding(s: seq[int]): int=constCacheLineSize=64# To avoid false sharing, partial_sums must be padded by a CPU cache line sizePadding=min(1, CacheLineSizedivsizeof(int))
var partial_sums =newSeq[int](omp_get_max_threads() *Padding)
# For sum, 0 is a neutral element but if we wanted to do "product"# the seq must be initialized with ones in case OpenMP scheduled less# threads than omp_get_max_threads.doAssertomp_get_num_threads() ==1# Outside of a parallel sectionfor i in0||(s.len-1):
# There is no way to get the number of threads used# except in the inner loop that will be repeated `s.len` times
partial_sums[omp_get_thread_num() *Padding] += s[i]
for i in0..<omp_get_max_threads(): # We assume all threads were used but it might not be trueresult+= partial_sums[i *Padding]
Solution 4: most flexible but not working at the moment
By using a local var we can avoid padding, omp_get_num_threads and extra allocation issues at the cost of using mutexes to merge the partial sums.
The main benefit is that we can now split parallel work in:
procreduction_localvar(s: seq[int]): int=# var num_threads_used: int
omp_parallel:
### initialization# # We have a proper way to get the number of threads used# # without setting them repeatedly in the for loop# omp_master:# num_threads_used = omp_get_num_threads()var local_sum =0# Variables declared in a prallel region are automatically private to each thread# http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-data.html#Default### for loopfor i in0||(s.len-1): # Unfortunately this does "omp parallel for" instead of just "omp for"
local_sum += s[i]
### Finalization
omp_critical: # This will use a mutex and can accept any C code (omp atomic would not work with Nim)result+= local_sum
But here the nested #pragma omp parallel for instead of just #pragma omp for will share the local_sum due to the creation of another omp parallel scope. It would work with just #pragma omp for.
Full demo
# Compile with# `nim c --stackTraces:off -r -d:openmp omp_reduction.nim`## On mac, default Clang doesn't support OpenMP use# `nim c --stackTraces:off --cc:gcc --gcc.exe:"/usr/local/bin/gcc-7" --gcc.linkerexe:"/usr/local/bin/gcc-7" -r -d:openmp omp_reduction.nim`import sequtils
whendefined(openmp):
{.passC: "-fopenmp".}
{.passL: "-fopenmp".}
{.pragma: omp, header:"omp.h".}
procomp_set_num_threads*(x: cint) {.omp.}
procomp_get_num_threads*(): cint {.omp.}
procomp_get_max_threads*(): cint {.omp.} # This takes hyperthreading into accountprocomp_get_thread_num*(): cint {.omp.}
else:
templateomp_set_num_threads*(x: cint) =discardtemplateomp_get_num_threads*(): cint=1templateomp_get_max_threads*(): cint=1templateomp_get_thread_num*(): cint=0templateomp_parallel*(body: untyped): untyped=
{.emit: "#pragma omp parallel".}
block:
body
templateomp_critical*(body: untyped): untyped=
{.emit: "#pragma omp critical".}
block:
body
templateomp_master*(body: untyped): untyped=
{.emit: "#pragma omp master".}
block:
body
# ########################################## OMP mangling, workaround for https://github.com/nim-lang/Nim/issues/9365import random
from strutils import toHex
var mangling_rng {.compileTime.} =initRand(0x1337DEADBEEF)
var current_suffix {.compileTime.} =""procomp_suffix(genNew: staticbool=false): staticstring=if genNew:
current_suffix = mangling_rng.rand(high(uint32)).toHex
result= current_suffix
# #########################################funcreduction_serial(s: seq[int]): int=for val in s:
result+= val
procreduction_reduction_clause(s: seq[int]): int=var sum{.exportc: "sum_"&omp_suffix(genNew =true).} =0const omp_annotation ="reduction(+:sum_"&omp_suffix(genNew =false) &')'# very restrictive in terms of op supported (add, mul, min, max)for i in`||`(0, s.len -1, omp_annotation):
let si = s[i]
{.emit: "`sum` += `si`;".}
return sum
procreduction_padding(s: seq[int]): int=constCacheLineSize=64# To avoid false sharing, partial_sums must be padded by a CPU cache line sizePadding=min(1, CacheLineSizedivsizeof(int))
var partial_sums =newSeq[int](omp_get_max_threads() *Padding)
# For sum, 0 is a neutral element but if we wanted to do "product"# the seq must be initialized with ones in case OpenMP scheduled less# threads than omp_get_max_threadsdoAssertomp_get_num_threads() ==1# Outside of a parallel sectionfor i in0||(s.len-1):
# There is no way to get the number of threads used# except in the inner loop that will be repeated `s.len` times
partial_sums[omp_get_thread_num() *Padding] += s[i]
for i in0..<omp_get_max_threads(): # We assume all threads were used but it might not be trueresult+= partial_sums[i *Padding]
procreduction_localvar(s: seq[int]): int=# var num_threads_used: int
omp_parallel:
### initialization# # We have a proper way to get the number of threads used# # without setting them repeatedly in the for loop# omp_master:# num_threads_used = omp_get_num_threads()var local_sum =0# Variables declared in a prallel region are automatically private to each thread# http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-data.html#Default### for loopfor i in0||(s.len-1): # Unfortunately this does "omp parallel for" instead of just "omp for"
local_sum += s[i]
### Finalization
omp_critical: # This will use a mutex and can accept any C code (omp atomic would not work with Nim)result+= local_sum
let s =toSeq(1..100)
let expected =100* (100+1) div2echo"Expected: ", expected # 5050echo"Serial: ", s.reduction_serial # 5050echo"OMP reduction clause: ", s.reduction_reduction_clause # 5050echo"OMP padding: ", s.reduction_padding # 5050echo"OMP localvar: ", s.reduction_localvar # 20200 on a dual-core with hyper-threading
The text was updated successfully, but these errors were encountered:
mratsim
added a commit
to mratsim/laser
that referenced
this issue
Oct 24, 2018
I am not sure if there are future plans to expand OpenMP support in Nim but here is something I'd like to have:
Context
Nim codegen generates all
for
andwhile
loops aswhile true
loops in the C codegen.The only exception is Nim
||
(doc which produces#pragma omp parallel for
followed by a normal C for loop.However there are more OpenMP constructs that requires an actual for loop than
#pragma omp parallel for
for example#pragma omp for
,#pragma omp simd
,#pragma omp taskloop
#pragma omp distribute
, see OpenMP quicksheet.Proposed solution
The
||
should be changed fromto
and only produces
#pragma omp
+ a true C for loopApplications
One example application is a sum reduction due to all threads having to share the result variable.
Hera are the main concerns with reductions:
Overview of OpenMP solutions
Solution 1: Reference serial code
Solution 2: using OpenMP reduction clause
OpenMP offers a reduction clause but it's very restrictive in terms of supported operations:
+, *, min, max. And so requires the use of
emit
to avoid Nim usingaddInt
.Furthermore due to #9365 we need our own name mangling proc.
The big advantage is that it does all steps 1, 2, 3 for us.
Solution 3: using padding
To be less limited we can use a shared array or seq to store the partial sums, the main issue is determining its size. We can base it on the max number of hardware threads but it might be that within the OpenMP parallel section, not all threads are used.
For example assume we allocate
newSeq[int](4)
on a machine with 4 threads but OpenMP only uses 2 at runtime, if we do sum, it's fine to add the extra zeroes but if we are doing product/min/max it's not. Andomp_get_num_threads()
is always 1 when not in a parallel section and we can't use it in aomp parallel for
section without triggering it for each thread at each iteration.Additionally to avoid false sharing we must pad the partial_sums array or seq so that each partial_sum is separated by at least a cache_line (64 bytes).
Solution 4: most flexible but not working at the moment
By using a local var we can avoid padding, omp_get_num_threads and extra allocation issues at the cost of using mutexes to merge the partial sums.
The main benefit is that we can now split parallel work in:
omp_parallel: initialization_code actual_for_loop finalization_code / merging
But here the nested
#pragma omp parallel for
instead of just#pragma omp for
will share thelocal_sum
due to the creation of another omp parallel scope. It would work with just#pragma omp for
.Full demo
The text was updated successfully, but these errors were encountered: