Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Interop #195

Merged
merged 41 commits into from
May 8, 2024
Merged

GPU Interop #195

merged 41 commits into from
May 8, 2024

Conversation

braxtoncuneo
Copy link
Collaborator

Incorporates GPU interop via Harmonize, achieved through the following changes:

  • Restructuring of loop.py:
    • In order to execute particle simulation in parallel, the functionality that needs to be evaluated in parallel needs to be separated from the code that executes this functionality in the typical (serial) fashion. Hence, loop_source and loop_precursor_source must be decomposed into their constituent components.
      • generate_source_particle : does as it says, separated out so that it may be called separately within the make_work function of the GPU program
      • step_particle : essentially is the body of the while loop in loop_particle, separated out so that the looping functionality can be alternately handled as async calls
      • loop_particle : handles only the calls to the particle setup/teardown logic, as well as iteratively calling the step_particle functionality
      • exhaust_active_bank : as it says, loops particles from the active bank until it is exhausted
      • source_closeout : handles closeout of the loop_source function. This functionality is called only once by the gpu analog to loop_source
      • source_dd_resolution: resolves domain decomposition
      • loop_source : as it was, but with components separated
      • gpu_sources_spec : generates a Harmonize program specification analogous to he exhaust_active_bank function
      • gpu_loop_source: an alternate version of loop_source that uses the GPU program generated by gpu_sources_spec to perform transport on GPU
      • generate_precursor_particle: analogous to generate_source_particle, but for the loop_precursor_particle functionality
      • source_precursor_closeout: analogous to source_closeout, but for loop_precursor_particle
      • loop_source precursor: as it was, but with components separated
      • gpu_precursor_spec: analogous to gpu_sources_spec, but for loop_precursor_particle
      • gpu_loop_spec: analogous to gpu_loop_source, but for loop_precursor_particle
  • Alignment of types:
    • While CPU execution can robustly handle all sorts of Numba types, GPU execution requires structs to follow some of the basic properties expected of C-style structs with standard layout:
      • Every primitive field is aligned by its size, and padding is inserted between fields to ensure alignment in arrays and nested data structures
      • Every field has a unique address
    • If these rules are violated, memory accesses made in GPUs may encounter problems. For example, in cases where an access is not at an address aligned by their size, a segfault or similar fault will occur, or information will be lost. These issues were fixed by providing a function, align, which ensures the field lists fed to np.dtype fulfill these requirements. This function does the following:
      • Tracks the cumulative offset of fields as they appear in the input list.
      • Inserts additional padding fields to ensure that primitive fields are aligned by their size
      • Re-sizes arrays to have at least one element in their array (this ensure they have a non-zero size, and hence cannot overlap base addresses with other fields.
  • Adapters:
    • The same sort of functionality may need to be implemented differently for CPU vs GPU. To these ends, the for_cpu, for_gpu, and toggle decorators were created. The first two decorators register inputs as for either CPU or GPU jit targets in numba. The last one replaces a function with an empty-bodied stand-in if the input toggle is set as False, thus avoiding compilation issues for functions that will only be called on CPU but which us functionality that would be an error for the GPU jit compiler.
    • What happens to particles when they are "banked" changes depending upon which platform is being targeted, and upon the bank itself. Notably, adding to the active bank on GPU corresponds with the re-scheduling of the input particles for additional execution, rather than simply storing the particles into a bank. To enable automatic switching between the two, functions such as add_active and add_census were defined.
    • Both GPUs and CPUs may have arrays as local variables, but they require calls to different functions to be created, and using the function for a CPU array is a compiler error for the GPU jit compiler (and vice-versa). To avoid this problem, functions such as local_particle and local_group_array are used to make local structs/arrays.

Copy link
Member

@ilhamv ilhamv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, regarding data alignment in type_.py, what are the types that particularly need the alignment?

mcdc/type_.py Outdated Show resolved Hide resolved
mcdc/type_.py Show resolved Hide resolved
@@ -765,6 +922,107 @@ def davidson(mcdc):


@njit(cache=caching)
def generate_precursor_particle(DNP, particle_idx, seed_work, prog):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this we are changing this from wholly within the for loop on 1062 to a separate function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. This is being extracted out, so that it may also be called in make_work on line 1104 as well without duplicating the logic. This way, if we change this function, the logic carries over for both CPU and GPU.

mcdc/type_.py Outdated Show resolved Hide resolved
@jpmorgan98 jpmorgan98 self-assigned this May 6, 2024
@jpmorgan98 jpmorgan98 added enhancement New feature or request hpc Issues relating to HPC deployments labels May 6, 2024
mcdc/type_.py Show resolved Hide resolved
@braxtoncuneo
Copy link
Collaborator Author

There are a couple of issues that I would like to bring up before this is merged:

  • It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.
  • While domain decomposition still works with this refactor, it is now mysteriously slower vs the normal Python version. There's no telling how long hunting down the source of this slowdown will take, and I don't want to pause merges to the dev branch forever. If you are fine with merging this branch in now, I'll further investigate the slowdown in compiled DD afterwards.

@braxtoncuneo
Copy link
Collaborator Author

Out of curiosity, regarding data alignment in type_.py, what are the types that particularly need the alignment?

All types need to be aligned, but whether or not something needs to be done to align them is context dependent. For the sake of alignment, structs need to be laid out assuming that the base address is divisible by the largest alignment size we care about. From there, fields are laid out in sequence, in the order they appear in the list, laying out sub-structs recursively. By default, Numba packs all fields next to each other, with no additional alignment considerations.

An example of a case where padding is needed is an 8-byte field (A), followed by a 1-byte field (B), followed by an 8-byte field (C).

This is how Numba would lay it out in memory (each letter representing a byte):
AAAAAAAABCCCCCCCC

This seems sensible, but then you notice that A and C cannot both be aligned to a base address divisible by 8. To ensure both are aligned, some padding must be provided:
AAAAAAAAB.......CCCCCCCC

Padding like this (though of differing amounts) would be necessary for any combination of A and C with sizes greater than 1 byte.

Technically speaking 1-byte types could be considered as types that "don't care about alignment", but it would be more accurate to say that it is impossible to make them not aligned, since all addresses are divisible by 1.

@braxtoncuneo braxtoncuneo marked this pull request as ready for review May 7, 2024 02:12
@braxtoncuneo
Copy link
Collaborator Author

No rush. Just wanted to unblock it from my end, since Kayla gave the go-ahead and nobody in the slack seemed opposed.

@ilhamv
Copy link
Member

ilhamv commented May 7, 2024

It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.

That is because some information, including how it is presented/structured, is relevant for the input interface, while others are relevant only in the simulation global state, and vice versa. The reconciliation particularly happens in prepare() in main.py.

@ilhamv
Copy link
Member

ilhamv commented May 7, 2024

Do we set a GitHub workflow to do the GPU regression test in this PR? If not, or not possible, what is the plan? @braxtoncuneo @jpmorgan98

)
elif srun > 1:
os.system(
"srun -n %i python input.py --mode=%s --output=output --no-progress-bar > tmp 2>&1"
% (srun, mode)
"srun -n %i python input.py --mode=%s --target=%s --output=output --no-progress-bar > tmp 2>&1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good idea!

@jpmorgan98
Copy link
Collaborator

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

type_roster = {}


def copy_fn_for(kind, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use/need type_roster and copy_fn_for? @braxtoncuneo

mcdc/main.py Show resolved Hide resolved
@jpmorgan98
Copy link
Collaborator

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

Ok I got the runner up and going I am going to try and get harmonize to auto configure with MC/DC via the install script, add the proper runner then add a commit to this PR

@braxtoncuneo
Copy link
Collaborator Author

Strangely, Ilham's latest commit is failing in the CEMeNT repo but passing in the fork. I'm going to run the regression tests locally to try to figure out a cause.

@jpmorgan98
Copy link
Collaborator

jpmorgan98 commented May 8, 2024

GPU regression testing is waiting on #196 to be resolved on the OSU CI machine. We should be able to run regression tests manually and locally for up coming PRs @ilhamv

@ilhamv ilhamv merged commit 55ad1a9 into CEMeNT-PSAAP:dev May 8, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hpc Issues relating to HPC deployments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants