Add support for partial offloading of field API fields to device #68

wertysas · 2024-12-04T13:02:50Z

Is your feature request related to a problem? Please describe.

When working with large fields they may not fit on device. A well-known solution, when running out of device memory, is to block the data offload and device kernels in a loop and operate on smaller chunks. However, there is no ability to partially offload Field API fields onto device without allocating the entire field on device. This makes it impossible to block device offloads when using Field API without introducing new fields for the blocks, which leads to unnecessary allocations.

Describe the solution you'd like

I propose that we introduce partial offloading of fields to device. This will allow blocking of device kernels, without running out of memory on device for large fields. An implementation of such blocking using OpenACC and fortran arrays has been demonstrated to work on dwarf-p-cloudsc with both increased performance and the possibility to handle much larger problem sizes than before without running out of device memory(see more below).

I suggest a series of three pull request to introduce partial offloading of fields to Field API:

PR 1: Make the lower bounds of the DEVPTR member of fields always start from one, as opposed to before, when it inherited the lower and upper bounds of the HST pointer member. This change is a first step towards decoupling the bounds of the device pointer from the bounds of the host pointer. Note that this does not change the user-facing API.

PR 2: Introduce the ability to offload blocks of fields onto the device and only allocate the block size on device. With the use case in mind, I think it makes sense to only allow blocking in the outermost dimension. This could be achieved by introducing an optional dummy argument, BLK_BOUNDS, to the copy methods of fields. This would be an array of two integers that hold the lower and upper index of the block. An example of how the updated signature would look in the GET_DEVICE_DATA method is shown below:

SUBROUTINE ${ftn}$_GET_DEVICE_DATA (SELF, MODE, PTR, QUEUE, BLK_BOUNDS)

The following is a toy example of how blocked offloading and kernel invocations could look with the suggested API.

NUM_BLKS=10
DO BLK=1,10
  BLK_BOUNDS = [(BLK-1)*NUM_BLKS+1, BLK*NUM_BLKS]
  CALL W%GET_DEVICE_DATA_RDWR(W_DEVPTR, BLK_BOUNDS=BLK_BOUNDS)
  ! kernel parallelised over BLK_BOUNDS(1) to BLK_BOUNDS(2)
  CALL W%SYNC_HOST_RDWR(BLK_BOUNDS=BLK_BOUNDS)
END DO

PR 3: Add support for overlapping communication and computation by asynchronous batching of partial offloads and kernel launches. In order to use asynchronous data offloads with multiple blocks of the same field we need to decide how the blocks should be allocated on device. Furthermore, in order to avoid race conditions when creating device data I suggest that we expose a CREATE_DEVICE_DATA method to the user. Then and optional dummy argument NQUEUES can be added to the copy methods, that represent the total number of streams.
An example of how the updated signature would look in the GET_DEVICE_DATA method is shown below:

SUBROUTINE ${ftn}$_GET_DEVICE_DATA (SELF, MODE, PTR, QUEUE, BLK_BOUNDS, NQUEUES)

The following is a toy example with 2 streams that show how blocked offloading and kernel invocations could look with the suggested API.

NUM_BLKS=10
NQUEUES=2
DO BLK=1,10
  QUEUE= MOD(BLK, NQUEUES)
   BLK_BOUNDS = [(BLK-1)*NUM_BLKS+1, BLK*NUM_BLKS]
  CALL W%GET_DEVICE_DATA_RDWR(W_DEVPTR, BLK_BOUNDS=BLK_BOUNDS, QUEUE=QUEUE, NQUEUES=NQEUEUS)
  ! asynchronous kernel parallelised over BLK_BOUNDS(1) and BLK_BOUNDS(2) on stream QUEUE
  CALL W%SYNC_HOST_RDWR(BLK_BOUNDS=BLK_BOUNDS, QUEUE=QUEUE, NQUEUES=NQUEUES)
END DO

The suggested modifications would not affect existing code that is using the public API, but still add the functionality required to do blocked offloading of fields.

Describe alternatives you've considered

No response

Additional context

Example from dwarf-p-cloudsc
The concept of a double blocked driver loop has already been investigated using dwarf-p-cloudsc, which is a mini application based on parts the cloud layer from the IFS. The implementation was done using OpenACC for offloading fortran arrays and it was shown to maintain good scaling with increased problem sizes and even achieve better performance than the non-blocked original version. The figure below shows the results from measurements done on a NVIDIA A100 on ECMWF's ATOS supercomputer. Note that the non-blocked versions SCC, and SCC-k-caching are unable to execute with problem sizes larger than ~360K, and ~500K grid points respectively, whilst the blocked versionsSCC-BLK, and SCC-dblk-k-caching are able to process much larger problem sizes.

Figure: Performance measurement of blocked and non-blocked driver loops on a NVIDIA A100 GPU. (HIGHER IS BETTER)

Organisation

ECMWF

The text was updated successfully, but these errors were encountered:

wertysas added the enhancement New feature or request label Dec 4, 2024

wertysas mentioned this issue Dec 4, 2024

Make FIELD%DEVPTR LBOUNDS always one #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for partial offloading of field API fields to device #68

Add support for partial offloading of field API fields to device #68

wertysas commented Dec 4, 2024 •

edited

Loading

Add support for partial offloading of field API fields to device #68

Add support for partial offloading of field API fields to device #68

Comments

wertysas commented Dec 4, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

wertysas commented Dec 4, 2024 •

edited

Loading