-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Interop #195
GPU Interop #195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, regarding data alignment in type_.py
, what are the types that particularly need the alignment?
@@ -765,6 +922,107 @@ def davidson(mcdc): | |||
|
|||
|
|||
@njit(cache=caching) | |||
def generate_precursor_particle(DNP, particle_idx, seed_work, prog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this we are changing this from wholly within the for loop on 1062 to a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. This is being extracted out, so that it may also be called in make_work
on line 1104 as well without duplicating the logic. This way, if we change this function, the logic carries over for both CPU and GPU.
There are a couple of issues that I would like to bring up before this is merged:
|
All types need to be aligned, but whether or not something needs to be done to align them is context dependent. For the sake of alignment, structs need to be laid out assuming that the base address is divisible by the largest alignment size we care about. From there, fields are laid out in sequence, in the order they appear in the list, laying out sub-structs recursively. By default, Numba packs all fields next to each other, with no additional alignment considerations. An example of a case where padding is needed is an 8-byte field (A), followed by a 1-byte field (B), followed by an 8-byte field (C). This is how Numba would lay it out in memory (each letter representing a byte): This seems sensible, but then you notice that A and C cannot both be aligned to a base address divisible by 8. To ensure both are aligned, some padding must be provided: Padding like this (though of differing amounts) would be necessary for any combination of A and C with sizes greater than 1 byte. Technically speaking 1-byte types could be considered as types that "don't care about alignment", but it would be more accurate to say that it is impossible to make them not aligned, since all addresses are divisible by 1. |
No rush. Just wanted to unblock it from my end, since Kayla gave the go-ahead and nobody in the slack seemed opposed. |
That is because some information, including how it is presented/structured, is relevant for the input interface, while others are relevant only in the simulation global state, and vice versa. The reconciliation particularly happens in |
Do we set a GitHub workflow to do the GPU regression test in this PR? If not, or not possible, what is the plan? @braxtoncuneo @jpmorgan98 |
) | ||
elif srun > 1: | ||
os.system( | ||
"srun -n %i python input.py --mode=%s --output=output --no-progress-bar > tmp 2>&1" | ||
% (srun, mode) | ||
"srun -n %i python input.py --mode=%s --target=%s --output=output --no-progress-bar > tmp 2>&1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good idea!
I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page. I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much |
type_roster = {} | ||
|
||
|
||
def copy_fn_for(kind, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use/need type_roster
and copy_fn_for
? @braxtoncuneo
Ok I got the runner up and going I am going to try and get harmonize to auto configure with MC/DC via the install script, add the proper runner then add a commit to this PR |
Strangely, Ilham's latest commit is failing in the CEMeNT repo but passing in the fork. I'm going to run the regression tests locally to try to figure out a cause. |
Incorporates GPU interop via Harmonize, achieved through the following changes:
loop_source
andloop_precursor_source
must be decomposed into their constituent components.generate_source_particle
: does as it says, separated out so that it may be called separately within themake_work
function of the GPU programstep_particle
: essentially is the body of the while loop inloop_particle
, separated out so that the looping functionality can be alternately handled as async callsloop_particle
: handles only the calls to the particle setup/teardown logic, as well as iteratively calling thestep_particle
functionalityexhaust_active_bank
: as it says, loops particles from the active bank until it is exhaustedsource_closeout
: handles closeout of theloop_source
function. This functionality is called only once by the gpu analog toloop_source
source_dd_resolution
: resolves domain decompositionloop_source
: as it was, but with components separatedgpu_sources_spec
: generates a Harmonize program specification analogous to heexhaust_active_bank
functiongpu_loop_source
: an alternate version ofloop_source
that uses the GPU program generated bygpu_sources_spec
to perform transport on GPUgenerate_precursor_particle
: analogous togenerate_source_particle
, but for theloop_precursor_particle
functionalitysource_precursor_closeout
: analogous tosource_closeout
, but forloop_precursor_particle
loop_source precursor
: as it was, but with components separatedgpu_precursor_spec
: analogous togpu_sources_spec
, but forloop_precursor_particle
gpu_loop_spec
: analogous togpu_loop_source
, but forloop_precursor_particle
align
, which ensures the field lists fed tonp.dtype
fulfill these requirements. This function does the following:for_cpu
,for_gpu
, andtoggle
decorators were created. The first two decorators register inputs as for either CPU or GPU jit targets in numba. The last one replaces a function with an empty-bodied stand-in if the input toggle is set asFalse
, thus avoiding compilation issues for functions that will only be called on CPU but which us functionality that would be an error for the GPU jit compiler.add_active
andadd_census
were defined.local_particle
andlocal_group_array
are used to make local structs/arrays.