GPU Interop #195

braxtoncuneo · 2024-05-06T10:14:56Z

Incorporates GPU interop via Harmonize, achieved through the following changes:

Restructuring of loop.py:
- In order to execute particle simulation in parallel, the functionality that needs to be evaluated in parallel needs to be separated from the code that executes this functionality in the typical (serial) fashion. Hence, loop_source and loop_precursor_source must be decomposed into their constituent components.
  - generate_source_particle : does as it says, separated out so that it may be called separately within the make_work function of the GPU program
  - step_particle : essentially is the body of the while loop in loop_particle, separated out so that the looping functionality can be alternately handled as async calls
  - loop_particle : handles only the calls to the particle setup/teardown logic, as well as iteratively calling the step_particle functionality
  - exhaust_active_bank : as it says, loops particles from the active bank until it is exhausted
  - source_closeout : handles closeout of the loop_source function. This functionality is called only once by the gpu analog to loop_source
  - source_dd_resolution: resolves domain decomposition
  - loop_source : as it was, but with components separated
  - gpu_sources_spec : generates a Harmonize program specification analogous to he exhaust_active_bank function
  - gpu_loop_source: an alternate version of loop_source that uses the GPU program generated by gpu_sources_spec to perform transport on GPU
  - generate_precursor_particle: analogous to generate_source_particle, but for the loop_precursor_particle functionality
  - source_precursor_closeout: analogous to source_closeout, but for loop_precursor_particle
  - loop_source precursor: as it was, but with components separated
  - gpu_precursor_spec: analogous to gpu_sources_spec, but for loop_precursor_particle
  - gpu_loop_spec: analogous to gpu_loop_source, but for loop_precursor_particle
Alignment of types:
- While CPU execution can robustly handle all sorts of Numba types, GPU execution requires structs to follow some of the basic properties expected of C-style structs with standard layout:
  - Every primitive field is aligned by its size, and padding is inserted between fields to ensure alignment in arrays and nested data structures
  - Every field has a unique address
- If these rules are violated, memory accesses made in GPUs may encounter problems. For example, in cases where an access is not at an address aligned by their size, a segfault or similar fault will occur, or information will be lost. These issues were fixed by providing a function, align, which ensures the field lists fed to np.dtype fulfill these requirements. This function does the following:
  - Tracks the cumulative offset of fields as they appear in the input list.
  - Inserts additional padding fields to ensure that primitive fields are aligned by their size
  - Re-sizes arrays to have at least one element in their array (this ensure they have a non-zero size, and hence cannot overlap base addresses with other fields.
Adapters:
- The same sort of functionality may need to be implemented differently for CPU vs GPU. To these ends, the for_cpu, for_gpu, and toggle decorators were created. The first two decorators register inputs as for either CPU or GPU jit targets in numba. The last one replaces a function with an empty-bodied stand-in if the input toggle is set as False, thus avoiding compilation issues for functions that will only be called on CPU but which us functionality that would be an error for the GPU jit compiler.
- What happens to particles when they are "banked" changes depending upon which platform is being targeted, and upon the bank itself. Notably, adding to the active bank on GPU corresponds with the re-scheduling of the input particles for additional execution, rather than simply storing the particles into a bank. To enable automatic switching between the two, functions such as add_active and add_census were defined.
- Both GPUs and CPUs may have arrays as local variables, but they require calls to different functions to be created, and using the function for a CPU array is a compiler error for the GPU jit compiler (and vice-versa). To avoid this problem, functions such as local_particle and local_group_array are used to make local structs/arrays.

…o be structs

mcdc/loop.py

ilhamv

Out of curiosity, regarding data alignment in type_.py, what are the types that particularly need the alignment?

mcdc/type_.py

jpmorgan98 · 2024-05-06T16:06:06Z

mcdc/loop.py

@@ -765,6 +922,107 @@ def davidson(mcdc):


 @njit(cache=caching)
+def generate_precursor_particle(DNP, particle_idx, seed_work, prog):


If I understand this we are changing this from wholly within the for loop on 1062 to a separate function

You are correct. This is being extracted out, so that it may also be called in make_work on line 1104 as well without duplicating the logic. This way, if we change this function, the logic carries over for both CPU and GPU.

mcdc/type_.py

braxtoncuneo · 2024-05-06T17:24:21Z

There are a couple of issues that I would like to bring up before this is merged:

It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.
While domain decomposition still works with this refactor, it is now mysteriously slower vs the normal Python version. There's no telling how long hunting down the source of this slowdown will take, and I don't want to pause merges to the dev branch forever. If you are fine with merging this branch in now, I'll further investigate the slowdown in compiled DD afterwards.

braxtoncuneo · 2024-05-06T21:53:02Z

Out of curiosity, regarding data alignment in type_.py, what are the types that particularly need the alignment?

All types need to be aligned, but whether or not something needs to be done to align them is context dependent. For the sake of alignment, structs need to be laid out assuming that the base address is divisible by the largest alignment size we care about. From there, fields are laid out in sequence, in the order they appear in the list, laying out sub-structs recursively. By default, Numba packs all fields next to each other, with no additional alignment considerations.

An example of a case where padding is needed is an 8-byte field (A), followed by a 1-byte field (B), followed by an 8-byte field (C).

This is how Numba would lay it out in memory (each letter representing a byte):
AAAAAAAABCCCCCCCC

This seems sensible, but then you notice that A and C cannot both be aligned to a base address divisible by 8. To ensure both are aligned, some padding must be provided:
AAAAAAAAB.......CCCCCCCC

Padding like this (though of differing amounts) would be necessary for any combination of A and C with sizes greater than 1 byte.

Technically speaking 1-byte types could be considered as types that "don't care about alignment", but it would be more accurate to say that it is impossible to make them not aligned, since all addresses are divisible by 1.

braxtoncuneo · 2024-05-07T02:13:36Z

No rush. Just wanted to unblock it from my end, since Kayla gave the go-ahead and nobody in the slack seemed opposed.

ilhamv · 2024-05-07T15:22:17Z

It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.

That is because some information, including how it is presented/structured, is relevant for the input interface, while others are relevant only in the simulation global state, and vice versa. The reconciliation particularly happens in prepare() in main.py.

ilhamv · 2024-05-07T16:13:57Z

Do we set a GitHub workflow to do the GPU regression test in this PR? If not, or not possible, what is the plan? @braxtoncuneo @jpmorgan98

ilhamv · 2024-05-07T16:16:38Z

test/regression/run.py

        )
    elif srun > 1:
        os.system(
-            "srun -n %i python input.py --mode=%s --output=output --no-progress-bar > tmp 2>&1"
-            % (srun, mode)
+            "srun -n %i python input.py --mode=%s --target=%s --output=output --no-progress-bar > tmp 2>&1"


This is a really good idea!

jpmorgan98 · 2024-05-07T16:18:11Z

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

ilhamv · 2024-05-07T16:49:16Z

mcdc/type_.py

+type_roster = {}
+
+
+def copy_fn_for(kind, name):


Do we use/need type_roster and copy_fn_for? @braxtoncuneo

mcdc/main.py

jpmorgan98 · 2024-05-07T17:56:09Z

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

Ok I got the runner up and going I am going to try and get harmonize to auto configure with MC/DC via the install script, add the proper runner then add a commit to this PR

braxtoncuneo · 2024-05-07T20:59:42Z

Strangely, Ilham's latest commit is failing in the CEMeNT repo but passing in the fork. I'm going to run the regression tests locally to try to figure out a cause.

jpmorgan98 · 2024-05-08T15:36:49Z

GPU regression testing is waiting on #196 to be resolved on the OSU CI machine. We should be able to run regression tests manually and locally for up coming PRs @ilhamv

Braxton Cuneo and others added 30 commits August 9, 2023 15:59

Per-Particle RNG and Array Tallies

7a95ccb

Added Large Prime Coefficient to Initial Particle Seeds

322ff18

Fixed sizes of 1-element tally arrays

2cc06b7

Fixed MPI talley transmission

0c1c6c0

Local particle instance creator functions

ecd07d0

Commit for backup

afb9009

adapt.py and particle banking functions

3f7b75e

Mid-refactor Commit with Notes and Diagrams

00e4bb0

Updated call graph

c3aac1b

Merge branch 'local-rng' into harmonize-integration

01fe4e9

Commit for comparison

26c4513

Commit for comparison

920ca9b

Removed striding of seeds

dcee388

Merge branch 'local-rng' into harmonize-integration

f1f2069

Commit for checkout

3008bac

Added adapter logic

02ae4c3

Added more decorators, alignment logic, and refactored translations t…

be8bcfd

…o be structs

Refactor Checkpoint

871c368

Signs of life

cfeb1fd

On-GPU processing partially working

e6b0e0a

Added --target option and logic for more complex scheduling

4345c19

Kernels are now relaunched if not all work is complete

864e1a7

Merged with dev. Regression tests in both in and out of numba mode

dd939f6

Checkpoint commit

1f58f98

Kobayashi works on GPU

b9f6e69

Cleaned up typing and initialization

1f0864a

Back in black

da97944

Fixed iqmc sub-struct

1ca494f

Back in black

af852eb

Removed debug prints

e4a28a2

Updated tests

a0070e5

jpmorgan98 reviewed May 6, 2024

View reviewed changes

mcdc/loop.py Show resolved Hide resolved

ilhamv reviewed May 6, 2024

View reviewed changes

mcdc/type_.py Outdated Show resolved Hide resolved

mcdc/type_.py Show resolved Hide resolved

jpmorgan98 reviewed May 6, 2024

View reviewed changes

ilhamv reviewed May 6, 2024

View reviewed changes

mcdc/type_.py Outdated Show resolved Hide resolved

jpmorgan98 self-assigned this May 6, 2024

jpmorgan98 added enhancement New feature or request hpc Issues relating to HPC deployments labels May 6, 2024

ilhamv reviewed May 6, 2024

View reviewed changes

mcdc/type_.py Show resolved Hide resolved

braxtoncuneo added 3 commits May 6, 2024 10:26

Cleaned up

6254fe2

Fixed checks for GPU interop availability

729b785

Removed debug statement

9c76a18

braxtoncuneo marked this pull request as ready for review May 7, 2024 02:12

ilhamv reviewed May 7, 2024

View reviewed changes

cosmetic edit on regression runner

ffb39a8

BACK IN BLACK!

9a9c8b2

ilhamv reviewed May 7, 2024

View reviewed changes

mcdc/type_.py

type_roster = {}

def copy_fn_for(kind, name):

Copy link

Member

ilhamv May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use/need type_roster and copy_fn_for? @braxtoncuneo

ilhamv reviewed May 7, 2024

View reviewed changes

mcdc/main.py Show resolved Hide resolved

Consistent use of str_ in type_.py

7ec41bf

ilhamv approved these changes May 8, 2024

View reviewed changes

ilhamv merged commit 55ad1a9 into CEMeNT-PSAAP:dev May 8, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Interop #195

GPU Interop #195

braxtoncuneo commented May 6, 2024

ilhamv left a comment

jpmorgan98 May 6, 2024

braxtoncuneo May 7, 2024

braxtoncuneo commented May 6, 2024

braxtoncuneo commented May 6, 2024

braxtoncuneo commented May 7, 2024

ilhamv commented May 7, 2024

ilhamv commented May 7, 2024

ilhamv May 7, 2024

jpmorgan98 commented May 7, 2024

ilhamv May 7, 2024

jpmorgan98 commented May 7, 2024

braxtoncuneo commented May 7, 2024

jpmorgan98 commented May 8, 2024 •

edited

Loading

		@@ -765,6 +922,107 @@ def davidson(mcdc):


		@njit(cache=caching)
		def generate_precursor_particle(DNP, particle_idx, seed_work, prog):

GPU Interop #195

GPU Interop #195

Conversation

braxtoncuneo commented May 6, 2024

ilhamv left a comment

Choose a reason for hiding this comment

jpmorgan98 May 6, 2024

Choose a reason for hiding this comment

braxtoncuneo May 7, 2024

Choose a reason for hiding this comment

braxtoncuneo commented May 6, 2024

braxtoncuneo commented May 6, 2024

braxtoncuneo commented May 7, 2024

ilhamv commented May 7, 2024

ilhamv commented May 7, 2024

ilhamv May 7, 2024

Choose a reason for hiding this comment

jpmorgan98 commented May 7, 2024

ilhamv May 7, 2024

Choose a reason for hiding this comment

jpmorgan98 commented May 7, 2024

braxtoncuneo commented May 7, 2024

jpmorgan98 commented May 8, 2024 • edited Loading

jpmorgan98 commented May 8, 2024 •

edited

Loading