Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specification for GPU device index #96

Merged
merged 43 commits into from
Mar 28, 2024
Merged

Conversation

jwallwork23
Copy link
Contributor

Closes #85.

The main change associated with this PR is allowing the GPU device index to be specified for the following functions and subroutines:

  • torch_zeros (C++) / torch_tensor_zeros (Fortran)
  • torch_ones (C++) / torch_tensor_zeros (Fortran)
  • torch_empty (C++)
  • torch_from_blob (C++) / torch_tensor_from_blob (Fotran)
  • torch_jit_load (C++) / torch_module_load (Fortran)
  • torch_tensor_from_array_${PREC}$_${RANK}$d (Fortran)

To avoid confusion/ambiguity, device is replaced by device_type in several places in the code, as device_type and device_index are consistent with the naming used in CUDA.

The GPU device index is specified using an additional argument, although this is made optional both in C++ and Fortran to ensure that the examples can be run without modification. In the case of torch_jit_load / torch_module_load, the device_type also needed to be added as an optional argument to support the new functionality.

If unset:

  • device_type defaults to torch_kCPU
  • device_index defaults to -1 if device_type is torch_kCPU and 0 if device_type is torch_kGPU.

New functions called torch_tensor_get_device_index are introduced so that we can test the new functionality.

@jwallwork23 jwallwork23 added the enhancement New feature or request label Mar 22, 2024
@jwallwork23 jwallwork23 self-assigned this Mar 22, 2024
@jwallwork23
Copy link
Contributor Author

Here is the test that I used:

! Import precision info from iso
use, intrinsic :: iso_fortran_env, only : sp => real32

! Import our library for interfacing with PyTorch
use ftorch

! Import MPI
use mpi

implicit none

! Set precision for reals
integer, parameter :: wp = sp

! Set up Fortran data structures
real(wp), dimension(5), target :: in_data
integer :: tensor_layout(1) = [1]

! Set up Torch data structures
type(torch_tensor) :: in_tensor
integer :: device_type
integer :: device_index

! MPI configuration
integer rank, ierr

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)

! Initialise data
in_data = [0.0, 1.0, 2.0, 3.0, 4.0]

! Loop over device type torch_kCPU and torch_kGPU
do device_type = 0, 1
  if (device_type == torch_kCPU) then
    device_index = - 1
  else
    device_index = rank
  end if

  ! Create Torch input tensor from the above arrays
  in_tensor = torch_tensor_from_array(in_data, tensor_layout, device_type, device_index)

  ! Print some information
  if (torch_tensor_get_device_index(in_tensor) == device_index) then
    write(*, *) rank, "PASS"
  else
    write(*, *) rank, "expected index ", device_index, "got ", torch_tensor_get_device_index(in_tensor)
  end if

  ! Cleanup
  call torch_tensor_delete(in_tensor)
end do
call mpi_finalize(ierr)

end program test_device_index

If run on my laptop (CPU-only), I get the output

           0 PASS
           1 PASS
           2 PASS
           3 PASS
[ERROR]: invalid device index 0 for device count [ERROR]: invalid device index 1 for device count 0, using zero instead
[ERROR]: invalid device index 2 for device count 0, using zero instead
[ERROR]: invalid device index 3 for device count 0, using zero instead
0, using zero instead
[ERROR]: PyTorch is not linked with support for cuda devices
[ERROR]: PyTorch is not linked with support for cuda devices
[ERROR]: PyTorch is not linked with support for cuda devices
[ERROR]: PyTorch is not linked with support for cuda devices

which confirms that the CPU case works, but obviously the GPU case isn't going to work.

If I run on Wilkes3 with four GPUs and four MPI processes, I get the output

           3 PASS
           2 PASS
           0 PASS
           1 PASS
           0 PASS
           1 PASS
           3 PASS
           2 PASS

which confirms that the GPU case works, too.

Copy link
Member

@jatkinson1000 jatkinson1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!
At a quick glance this looks good - I'll do a detailed review later when I have some time.
One quick comment before then - can you provide some simple instructions on how I can check/verify this is working on CSD3/elsewhere?

We will probably want an example adding to the examples/ and some info adding to the docs once the code is settled before it goes in.

This was referenced Mar 22, 2024
@jwallwork23
Copy link
Contributor Author

One quick comment before then - can you provide some simple instructions on how I can check/verify this is working on CSD3/elsewhere?

Sure. I created a new branch to demonstrate the testing: 85_gpu_device_number_test. Would you like me to include the Slurm scripts, too?

@jwallwork23 jwallwork23 force-pushed the 85_gpu_device_number branch from afa93d8 to 188b305 Compare March 25, 2024 12:31
@jwallwork23 jwallwork23 mentioned this pull request Mar 26, 2024
7 tasks
@jwallwork23
Copy link
Contributor Author

Okay, this is ready for re-review! I added some docs and managed to get example 3 working on Wilkes3, giving the following output for 2 GPUs:

input on rank0: [  0.0,  1.0,  2.0,  3.0,  4.0]
input on rank1: [  1.0,  2.0,  3.0,  4.0,  5.0]
output on rank1: [  2.0,  4.0,  6.0,  8.0, 10.0]
output on rank0: [  0.0,  2.0,  4.0,  6.0,  8.0]

Will test it for 4 GPUs, too, but don't anticipate any issues.

Copy link
Member

@jatkinson1000 jatkinson1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great addition @jwallwork23
The new docs and example read really well.

Added a couple of points that I feel would make things clearer for me as an external reader, feel free to incorporate or not.

Once we've resolved these I think we're good to go!

examples/3_MultiGPU/README.md Show resolved Hide resolved
examples/3_MultiGPU/README.md Outdated Show resolved Hide resolved
examples/3_MultiGPU/README.md Outdated Show resolved Hide resolved
examples/3_MultiGPU/README.md Show resolved Hide resolved
pages/gpu.md Show resolved Hide resolved
src/ctorch.cpp Outdated Show resolved Hide resolved
@jwallwork23
Copy link
Contributor Author

Thanks @jatkinson1000, this is now ready for re-review.

Will test it for 4 GPUs, too, but don't anticipate any issues.

I can confirm that this worked (with the updated 85_gpu_device_number_test branch) on Wilkes3, giving output

input on rank1: [  1.0,  2.0,  3.0,  4.0,  5.0]
input on rank2: [  2.0,  3.0,  4.0,  5.0,  6.0]
input on rank3: [  3.0,  4.0,  5.0,  6.0,  7.0]
input on rank0: [  0.0,  1.0,  2.0,  3.0,  4.0]
output on rank1: [  2.0,  4.0,  6.0,  8.0, 10.0]
output on rank2: [  4.0,  6.0,  8.0, 10.0, 12.0]
output on rank3: [  6.0,  8.0, 10.0, 12.0, 14.0]
output on rank0: [  0.0,  2.0,  4.0,  6.0,  8.0]

Copy link
Member

@jatkinson1000 jatkinson1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jwallwork23 This is a great addition!

All looking good to me now so I'll squash and merge shortly.

@jatkinson1000 jatkinson1000 merged commit 0efa2ba into main Mar 28, 2024
4 checks passed
@jatkinson1000 jatkinson1000 deleted the 85_gpu_device_number branch March 28, 2024 15:08
dorchard pushed a commit that referenced this pull request Nov 15, 2024
* Have get_device use torch::Device

* Add device_number arg for get_device

* Throw error if device_number used in CPU-only case

* Disallow negative device number

* Actually use the device number

* Use device number for torch_zeros

* Use device number for torch_ones

* Use device number for torch_empty

* Use device number for torch_from_blob

* Device and device number args for torch_module_load

* Pass device and device number to torch_jit_load by value

* Make device number argument to torch_module_load optional

* Make device number argument to torch_tensor_from_array optional

* Make device number argument to other subroutines optional

* Make device argument to torch_module_load optional

* Add function for determining device_index

* Rename device number as index

* Rename device as device type

* Device index defaults to -1 on CPU and 0 on GPU

* Make device type and index optional on C++ side

* Fix typo in torch_model_load

* Fix typos in example 1

* Initial draft of example 3_MultiGPU

* Differentiate between errors and warnings in C++ code

* Formatting

* Add mpi4py to requirements for example 3

* Use mpi4py to differ inputs in simplenet_infer_python

* Raise ValueError for Python inference with invalid device

* Print rank in Python case; updates to README

* Setup MPI for simplenet_infer_fortran, too

* Write formatting for example 3

* Add note on building with Make

* Print before and after; mpi_finalise; output on CPU; comments

* Docs: device->device_type for consistency

* Add docs on MultiGPU

* Update warning text for defaulting to 0

Co-authored-by: jatkinson1000 <109271713+jatkinson1000@users.noreply.github.com>

* Mention MPI in requirements

* Update outputs for example 3

* Use NP rather than 4 GPUs

* Implement SimpleNet in example 3 but with a twist

* Add code snippets for multi-GPU doc section

* Add note about multiple GPU support to README.md.

---------

Co-authored-by: jatkinson1000 <109271713+jatkinson1000@users.noreply.github.com>
Co-authored-by: Jack Atkinson <jwa34@cam.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User ability to decide GPU device number
2 participants