This repository contains CUDA exercises for CERN Openlab's GPU lecture. There's two methods to work on these exercises:
- Find a computer with a GPU. At CERN, you can e.g. use
ssh -X lxplus-gpu.cern.ch
for access to shared GPUs.
- Clone the following repository:
git clone https://github.com/hageboeck/OpenlabLecture.git
cd OpenlabLecture/source
- Use a terminal-based editor such as vim, nano, emacs to edit the files or try graphical editors like geany etc if you have an X client on your computer.
- To compile the executables:
- Try it manually using
nvcc -O2 -g -std=c++17 <filename>.cu -o <executable>
. - Use the Makefile, e.g.
make helloWord
for only one executable ormake
to compile all in one go.
- Try it manually using
- If you don't have a cernbox account yet, go to cernbox.cern.ch
- Go to swan.cern.ch, choose jupyterlab, LCG105 CUDA, start the session
- On the left, find the button to clone a git repository, and clone:
https://github.com/hageboeck/OpenlabLecture.git
- Now you have two choices:
- Write CUDA in notebooks:
- Use FirstSteps notebook for the first two tasks.
- Finally, go to Julia notebook for the third task.
- Go to
source/
and work directly with the files. Open a terminal to compile and run the programs.
- Write CUDA in notebooks:
helloWorld.cu Here we have a very basic helloWorld program that prints a message from the host.
Your tasks:
- Convert the HelloWorld function to a kernel, and call it from
main()
. - In the kernel, fill in the variables that print thread index and block index.
- Try a few launch configurations with more threads / more blocks.
vectorAdd.cu In this example, we add two arrays on the host and on the device. We use timers to measure the execution speed.
Your tasks:
- Implement an efficient grid-strided loop. Currently, every thread steps through every item.
- Find an efficient launch configuration to fully use the device.
julia.cu We compute the Julia and Fatou sets in the complex plane. This requires evaluating a quadratic complex polynomial for more than a million pixels in the complex plane. This is a perfect job for a GPU.
Your tasks:
- In the
main()
function, allocate memory for all pixels. The writePPM function expects an array of pixels in the form{y0={x0 x1 x2 x3 ...}, y1={x0 x1 x2 x3 ...}, ... y_n }
, so allocate a one-dimensional array with enough memory forx*y
pixels. There's already a section that checks for possible cuda errors, so allocate the memory just before that section. Don't forget to free the memory when you're done. - Launch the draft kernel from the main function. Check for possible errors.
- Figure out a way to compute the pixel indices
i
andj
fromthreadIdx.x
andblockIdx.x
. Find a kernel launch configuration that covers the entire image. - Implement the computation
z = z^2 + c
.- We will not use any external complex number classes for this, so square the complex number by hand.
- Plug the computation in the loop. Check that it runs
- for a maximum of
maxIter
times - or until z starts to diverge (
|z| >= maxMagnitude
).
- for a maximum of
- Record at which iteration z diverged in the pixel array. There's already a line of code that should take care of this, but ensure that your iteration counter ends up in that array.
- Note: We use 256 colours to colour the resulting image. We scale
k
into the range[1, 256]
for best contrast, but it's not strictly necessary.
- Check if you can generate a Julia image like this example
You can set the real and imaginary part for c
as command-line arguments:
./julia <realPart> <imaginaryPart>
Try for example:
./julia -0.4 0.6
./julia 0.285 -0.01
To display the image, we have two options:
- The image gets exported as png if boost GIL and libpng are available. The Makefile is not very smart in detecting those, so you might have to improvise a bit.
- The image is also exported as PPM, a very simple text-based image format. You can use imagemagick's
display
to view it:
display julia.ppm
If display
doesn't work, check that you have an ssh connection with X forwarding (ssh -X ...
). You can also use convert
to convert from ppm to something else.
The draft Julia kernel was using double precision for the complex numbers. Check if the usage of single-precision floating-point numbers yields satisfying results, and check the impact on the kernel execution. How fast can we go? (Note: Check what type of GPU the kernel is running on. Is it a data-centre grade GPU?)
There is a naive CPU kernel called juliaCPU
in julia.h
that can be used as a drop-in replacement for the GPU kernel. Check its speed. How much speed up against a single CPU thread can you reach with the GPU?
You might have written a kernel where i = threadIdx.x
. Whereas this is sufficient for our problem size, the maximum number of threads per SM is 1024, so your kernel might not be able to deal with larger images. Remember that you can use the grid-strided loop to process an array of arbitray size. Try using a grid-strided loop on an image of 2048x2048 pixels. You can use a linearised index from 0 to 2048x2048, and compute i and j using modulus and integer division.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.