NUMA-Aware 2d-stencil #5134
Replies: 2 comments 12 replies
-
We have 2d-stencil examples here: https://github.com/STEllAR-GROUP/tutorials. I also know, that @NK-Nikunj has worked on porting something similar to ARM architectures with wide vectorization. He might be able to elaborate. |
Beta Was this translation helpful? Give feedback.
-
@topkanoguzhan from what I understand, your partion_data stores a collection of data points in the form of The performance gap between non-NUMA aware and NUMA aware containers is significant (30-40%, sometimes even more) and you should be able to make up for the difference between HPX and the other runtime you mention. You can find NUMA aware implementation of 2D stencil codes here. Also, from the pattern of updates, I would not suggest explicitly vectorizing the codes (see this) as it won't give you more than 10-15% difference. If you still want to squeeze out that extra performance, you can look into my implementation of 2D-stencil with explicit vectorization. Initializing a different data layout followed by maintaining the halo regions can get tricky. And for the hours invested, the extra performance boost isn't really worth it. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
My name is Oguzhan and I wanted to ask for some advice on a little project I am working on.
It is similar in structure to the 4th stage of the 1d-stencil heat simulation from:
https://hpx-docs.stellar-group.org/latest/html/examples/1d_stencil.html
The space is 2 dimensional but imagine there are 3 grids. A, B, C, where A & B are
simulated over multiple iterations while C remains static and is involved in the
computations of A and B.
In every time step it:
A[it], B[it] and C are used to compute B[it+1]
B[it+1, A[it], and C are used to compute A[it+1]
For each point in the grid there are also neighboring elements required.
The grids are disaggregated similarly to the example in the documentions,
but the space is now a vector of partition_data futures that represent 2
dimensional partitions.
The for_each parallel algorithm is used to initialize the grids in parallel.
It iterates over each row of blocks, each iteration initializing that row's partitions of A, B, C.
And the tasks are generated in the order described above inside a loop iterating over each timestep.
This version is 10% faster than a naive OpenMP solution but a solution from another
task-parallel runtime system against which I try to compete achieves a
~25-30% faster runtime than OpenMP.
The question I had was how I could better achieve NUMA awareness in my solution.
My aim is to ensure on a 2-numa node setup to initialize the upper half of grids to
one node and the lower half to the other node so that I set up the tasks in a way
to limit the expensive memory loads from one numa node to the other
to the 2 middle rows of partitions of the grids where only vertical neighbours
need to be loaded.
Are there any tools or mechanisms available to achieve this?
I tried to look for something inside the GitHub Guide to HPX
but many examples were deleted.
Thank you very much in advance
Beta Was this translation helpful? Give feedback.
All reactions