You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When working with large fields they may not fit on device. A well-known solution, when running out of device memory, is to block the data offload and device kernels in a loop and operate on smaller chunks. However, there is no ability to partially offload Field API fields onto device without allocating the entire field on device. This makes it impossible to block device offloads when using Field API without introducing new fields for the blocks, which leads to unnecessary allocations.
Describe the solution you'd like
I propose that we introduce partial offloading of fields to device. This will allow blocking of device kernels, without running out of memory on device for large fields. An implementation of such blocking using OpenACC and fortran arrays has been demonstrated to work on dwarf-p-cloudsc with both increased performance and the possibility to handle much larger problem sizes than before without running out of device memory(see more below).
I suggest a series of three pull request to introduce partial offloading of fields to Field API:
PR 1: Make the lower bounds of the DEVPTR member of fields always start from one, as opposed to before, when it inherited the lower and upper bounds of the HST pointer member. This change is a first step towards decoupling the bounds of the device pointer from the bounds of the host pointer. Note that this does not change the user-facing API.
PR 2: Introduce the ability to offload blocks of fields onto the device and only allocate the block size on device. With the use case in mind, I think it makes sense to only allow blocking in the outermost dimension. This could be achieved by introducing an optional dummy argument, BLK_BOUNDS, to the copy methods of fields. This would be an array of two integers that hold the lower and upper index of the block. An example of how the updated signature would look in the GET_DEVICE_DATA method is shown below:
The following is a toy example of how blocked offloading and kernel invocations could look with the suggested API.
NUM_BLKS=10DO BLK=1,10
BLK_BOUNDS = [(BLK-1)*NUM_BLKS+1, BLK*NUM_BLKS]
CALL W%GET_DEVICE_DATA_RDWR(W_DEVPTR, BLK_BOUNDS=BLK_BOUNDS)
! kernel parallelised over BLK_BOUNDS(1) to BLK_BOUNDS(2)
CALL W%SYNC_HOST_RDWR(BLK_BOUNDS=BLK_BOUNDS)
END DO
PR 3: Add support for overlapping communication and computation by asynchronous batching of partial offloads and kernel launches. In order to use asynchronous data offloads with multiple blocks of the same field we need to decide how the blocks should be allocated on device. Furthermore, in order to avoid race conditions when creating device data I suggest that we expose a CREATE_DEVICE_DATA method to the user. Then and optional dummy argument NQUEUES can be added to the copy methods, that represent the total number of streams.
An example of how the updated signature would look in the GET_DEVICE_DATA method is shown below:
The following is a toy example with 2 streams that show how blocked offloading and kernel invocations could look with the suggested API.
NUM_BLKS=10
NQUEUES=2DO BLK=1,10
QUEUE=MOD(BLK, NQUEUES)
BLK_BOUNDS = [(BLK-1)*NUM_BLKS+1, BLK*NUM_BLKS]
CALL W%GET_DEVICE_DATA_RDWR(W_DEVPTR, BLK_BOUNDS=BLK_BOUNDS, QUEUE=QUEUE, NQUEUES=NQEUEUS)
! asynchronous kernel parallelised over BLK_BOUNDS(1) and BLK_BOUNDS(2) on stream QUEUE
CALL W%SYNC_HOST_RDWR(BLK_BOUNDS=BLK_BOUNDS, QUEUE=QUEUE, NQUEUES=NQUEUES)
END DO
The suggested modifications would not affect existing code that is using the public API, but still add the functionality required to do blocked offloading of fields.
Describe alternatives you've considered
No response
Additional context
Example from dwarf-p-cloudsc
The concept of a double blocked driver loop has already been investigated using dwarf-p-cloudsc, which is a mini application based on parts the cloud layer from the IFS. The implementation was done using OpenACC for offloading fortran arrays and it was shown to maintain good scaling with increased problem sizes and even achieve better performance than the non-blocked original version. The figure below shows the results from measurements done on a NVIDIA A100 on ECMWF's ATOS supercomputer. Note that the non-blocked versions SCC, and SCC-k-caching are unable to execute with problem sizes larger than ~360K, and ~500K grid points respectively, whilst the blocked versionsSCC-BLK, and SCC-dblk-k-caching are able to process much larger problem sizes.
Figure: Performance measurement of blocked and non-blocked driver loops on a NVIDIA A100 GPU. (HIGHER IS BETTER)
Organisation
ECMWF
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When working with large fields they may not fit on device. A well-known solution, when running out of device memory, is to block the data offload and device kernels in a loop and operate on smaller chunks. However, there is no ability to partially offload Field API fields onto device without allocating the entire field on device. This makes it impossible to block device offloads when using Field API without introducing new fields for the blocks, which leads to unnecessary allocations.
Describe the solution you'd like
I propose that we introduce partial offloading of fields to device. This will allow blocking of device kernels, without running out of memory on device for large fields. An implementation of such blocking using OpenACC and fortran arrays has been demonstrated to work on dwarf-p-cloudsc with both increased performance and the possibility to handle much larger problem sizes than before without running out of device memory(see more below).
I suggest a series of three pull request to introduce partial offloading of fields to Field API:
PR 1: Make the lower bounds of the
DEVPTR
member of fields always start from one, as opposed to before, when it inherited the lower and upper bounds of theHST
pointer member. This change is a first step towards decoupling the bounds of the device pointer from the bounds of the host pointer. Note that this does not change the user-facing API.PR 2: Introduce the ability to offload blocks of fields onto the device and only allocate the block size on device. With the use case in mind, I think it makes sense to only allow blocking in the outermost dimension. This could be achieved by introducing an optional dummy argument,
BLK_BOUNDS
, to the copy methods of fields. This would be an array of two integers that hold the lower and upper index of the block. An example of how the updated signature would look in theGET_DEVICE_DATA
method is shown below:The following is a toy example of how blocked offloading and kernel invocations could look with the suggested API.
PR 3: Add support for overlapping communication and computation by asynchronous batching of partial offloads and kernel launches. In order to use asynchronous data offloads with multiple blocks of the same field we need to decide how the blocks should be allocated on device. Furthermore, in order to avoid race conditions when creating device data I suggest that we expose a
CREATE_DEVICE_DATA
method to the user. Then and optional dummy argumentNQUEUES
can be added to the copy methods, that represent the total number of streams.An example of how the updated signature would look in the
GET_DEVICE_DATA
method is shown below:The following is a toy example with 2 streams that show how blocked offloading and kernel invocations could look with the suggested API.
The suggested modifications would not affect existing code that is using the public API, but still add the functionality required to do blocked offloading of fields.
Describe alternatives you've considered
No response
Additional context
Example from dwarf-p-cloudsc
The concept of a double blocked driver loop has already been investigated using dwarf-p-cloudsc, which is a mini application based on parts the cloud layer from the IFS. The implementation was done using OpenACC for offloading fortran arrays and it was shown to maintain good scaling with increased problem sizes and even achieve better performance than the non-blocked original version. The figure below shows the results from measurements done on a NVIDIA A100 on ECMWF's ATOS supercomputer. Note that the non-blocked versions
SCC
, andSCC-k-caching
are unable to execute with problem sizes larger than ~360K, and ~500K grid points respectively, whilst the blocked versionsSCC-BLK
, andSCC-dblk-k-caching
are able to process much larger problem sizes.Figure: Performance measurement of blocked and non-blocked driver loops on a NVIDIA A100 GPU. (HIGHER IS BETTER)
Organisation
ECMWF
The text was updated successfully, but these errors were encountered: