Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

lfmeadow · 2022-02-16T02:21:12Z

SYCL performance on NVIDIA A100 is currently 2-3% worse than native CUDA. Inspection of the PTX generated by SYCL shows extra parameters and instructions due to accessor and buffers. I added syclusm which uses malloc_device and explicit queue->memcpy. This reduces the performance gap to around 1% with the worst case being Dot at 1.19% .
The CUDA implementation of Dot is not optimal; the block size should be a multiple of the number of SMs (note that the SYCL versions already do this). I picked the same number as SYCL (4 * number of SMs) and CUDA Dot performance improves by 7.93% .
Next stop is AMD :)

…ywork

tomdeakin · 2022-02-16T09:08:15Z

Thanks for this @lfmeadow - it's really interesting that the buffers/accessors adds a small overhead. I'm a bit worried about that!

Did you try the SYCL 2020 reduction API with USM for the dot kernel? I'm asking because I'd like to use that API instead of the 1.2.1-style manual implementation if we can.

lfmeadow · 2022-02-16T16:15:41Z

I just eyeballed the PTX. I should look more carefully to see where the extra instructions are coming from. There were definitely a lot of parameters, maybe there's some dead argument elimination to be done (which is not enabled for NVPTX as I recall).

I'll look at the reduction API implementation today.

lfmeadow · 2022-02-17T16:17:29Z

I ran on AMD MI100 both "spock" at ORNL and one of the nodes at ANL JLSE.
Aside, the JLSE nodes are a tiny bit faster on this and other benchmarks. No idea why.
Initally Dot was lagging the others by quite a lot, even after adjusting the number of blocks to be a multiple of the number of CUs. Specifically, 14.7 % slower with HIP and 17.2% slower with SYCLUSM.
Then I reread some older AMD training slides https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf. There's a good discussion of occupancy. The GCN GPUs can have up to 40 active wavefronts per CU (10 per SIMD unit). Thus one condition to maximize occupancy is that the number of (64-lane) waves in a workgroup must divide 40. Both HIP and SYCL were using 1024. I tried 512, not much help, and then 256, and voila! Dot BW is 2.34% less than the lowest non-reduction BW for HIP.

The SYCLUSM penalty vs. HIP is similar to CUDA, under 1% except for Dot which 3.08% worse.

I'll try the SYCL reduction next.

Here's the numbers for HIP and SYCLUSM on Spock. HIP uses TBSIZE of 256 for everything (doesn't affect anything but Dot). SYCLUSM uses 256 only for Dot, the others use the default which turns out to be 1024.
(pretty funny, I cut and pasted from O365 excel, it comes out as HTML)

hip
function	num_times	n_elements	sizeof	max_mbytes_per_sec	min_runtime	max_runtime	avg_runtime
Copy	1000	33554432	8	1.01E+06	0.000532927	0.000634115	0.000534976
Mul	1000	33554432	8	1.01E+06	0.000531393	0.00058859	0.000533475
Add	1000	33554432	8	986756	0.000816115	0.000865828	0.000819944
Triad	1000	33554432	8	986670	0.000816186	0.000867171	0.000819913
Dot	1000	33554432	8	963618	0.000557141	0.0005667	0.000561678

syclusm
function	num_times	n_elements	sizeof	max_mbytes_per_sec	min_runtime	max_runtime	avg_runtime
Copy	1000	33554432	8	1.00E+06	0.000536222	0.000595593	0.000537737
Mul	1000	33554432	8	1.00E+06	0.0005346	0.000582999	0.00053611
Add	1000	33554432	8	982390	0.000819742	0.000875025	0.000826572
Triad	1000	33554432	8	979027	0.000822558	0.000877611	0.00082778
Dot	1000	33554432	8	933974	0.000574824	0.00060448	0.000583949

lfmeadow · 2022-02-18T21:22:31Z

I ran SYCL2020 on A100, just the vanilla version with no USM changes.
Dot is only 1198 GB/sec; I was geting 1290 GB/sec with SYCL version (not USM), and 1340 with my USM version.
I have not converted SYCL2020 to USM.
I can't see how to use the SYCL2020 reduction together with specifying number of workgroups and workgroup size. I suppose I could specify an ndrange and then have a loop like the older version, adding to the reduction variable.
But I'm not sure that the reduction machinery will honor this.
I'm asking around :)

… mywork

lfmeadow · 2022-02-21T15:23:55Z

SYCL2020 with a redone dot kernel (but not USM) doesn't do quite as well as the original SYCL version on dot on A100:
1235 GB/sec vs. 1292 GB/sec, and 1339 GB/sec with the tuned SYCLUSM version.
This will required some investigation.
Here's the rewritten SYCL2020 dot function:

template <class T>
T SYCLStream<T>::dot()
{
  
  queue->submit([&](sycl::handler &cgh)
  { 
    sycl::accessor ka {d_a, cgh, sycl::read_only};
    sycl::accessor kb {d_b, cgh, sycl::read_only};
    
    size_t N = array_size;
    cgh.parallel_for(
      sycl::nd_range<1>(dot_num_groups*dot_wgsize, dot_wgsize),
      // Reduction object, to perform summation - initialises the result to zero
      sycl::reduction(d_sum, cgh, std::plus<T>(), sycl::property::reduction::initialize_to_identity{}),
      [=](sycl::nd_item<1> item, auto& sum)
      { 
        size_t i = item.get_global_id(0);
        size_t global_size = item.get_global_range()[0];
        for (; i < N; i += global_size)
          sum += ka[i] * kb[i];
      });
  });
  
  // Get access on the host, and return a copy of the data (single number)
  // This will block until the result is available, so no need to wait on the queue.
  sycl::host_accessor result {d_sum, sycl::read_only};
  return result[0];
}

tomdeakin · 2022-02-28T12:02:38Z

Thank for all of this @lfmeadow - this is great stuff.

It's good to see what values for work-group size are working well in general too.

I'm worried the SYCL 2020 version was slower than the 1.2.1 version. There shouldn't be any major changes beyond syntactic sugar (the CTAD accessors, etc) so it should not affect performance...

Glad to see you got the sycl::reduction working with nd_range.
Hopefully we'll be able to get the implementation of the sycl::reduction just as fast as a manual implementation of the reduction. The implementation could just use this algorithm - do you know what algorithm it's using?

I'd hope that DPC++ would be able to incorporate these heuristics without requiring programmers to use nd_range when they don't need it - @Pennycook mentioned this earlier: #83 (comment)

Hopefully there will be a feature one day where we can suggest a work-group size for range without switching to nd_range which has a different model of parallelism.

lfmeadow · 2022-02-28T14:52:46Z

Yes, I need to revisit SYCL-2020 vs. the previous version without the USM and be a little more rigorous.

On the reduction, apparently it uses this:
https://github.com/intel/llvm/blob/8213321ebb90110bf4f3d04fa0dc8e131a464a19/libclc/ptx-nvidiacl/libspirv/group/collectives.cl#L263
I note that the subgroup reduction does not use the binary tree reduction; I suppose that isn't valid for all possible reduction operators (not sure).

On SYCL, SYCLUSM, and Cuda:
I compared PTX code for Add from SYCL, SYCLUSM, and Cuda.
SYCL kernel has 6 parameters vs. the 3 in SYCLUSM and Cuda.
Cuda has 14 SASS Instructions, SYCLUSM has 19, and SYCL has 31.
The extra SYCLUSM instructions are due to worse codegen and also from converting the global index to 64-bit (arguably semantically necessary in SYCL but not Cuda).
There are 3 extra SYCL instructions per accessor: ld.param, shift, and add.s64.

If the kernels were fatter, the overhead wouldn't be so bad; picking a smaller number of grid blocks (like we did for Dot) and looping over them in the kernel would probably measurably decrease the overhead.

Pennycook · 2022-02-28T16:05:39Z

The extra SYCLUSM instructions are due to worse codegen and also from converting the global index to 64-bit (arguably semantically necessary in SYCL but not Cuda).

If you know that the indices are 32-bit, you can try the -fsycl-id-queries-fit-in-int option. It basically wraps all of the functions in the sycl::id class which would return a 64-bit index with an assumption that the value would fit in an int.

tomdeakin · 2023-06-12T15:01:09Z

The CUDA update here is somewhat fixed by 092ee67 by setting the number of thread blocks to 1024. If we think the approach here of queuing the device is better CUDA code, we can use that instead. @jeffhammond - what would you recommend.

The HIP code has been updated by AMD in #139. They took a different approach, but as it cam directly from AMD I'm inclined to trust the heuristic for selecting a good size.

The USM SYCL version is still interesting, and we need to add a version that does that.

jeffhammond · 2023-07-15T13:01:00Z

I'll run on a bunch of devices and see what the sensitivity is.

tom91136 · 2023-09-05T02:32:10Z

I've got a SYCL2020 USM version implemented in 3f7bb63, will try to merge at some point.

This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP) Cherry-picks the CUDA updates from lmeadows in #122

tomdeakin · 2023-10-06T16:58:50Z

The CUDA change has now been merged manually into develop as 9954b7d
AMD have updated the HIP version themselves with similar changes to this.
And we now have a SYCL version with USM in develop.
Therefore, I'll close this as we now have everything included, albeit in a piecemeal way. Thanks all for the contributions!

This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP) Cherry-picks the CUDA updates from lmeadows in UoB-HPC#122

lfmeadow added 2 commits February 16, 2022 02:07

tuning, new sycl usm implementation

062a72e

Merge branch 'main' of https://github.com/lfmeadow/BabelStream into m…

0280f9f

…ywork

tomdeakin changed the base branch from main to develop February 16, 2022 14:40

fix HIP Dot

79b31c7

lfmeadow added 4 commits February 18, 2022 13:25

Merge branch 'develop' into mywork

b44305c

Fix for AMD workgroup size.

6b69bbc

Merge branch 'mywork' of https://github.com/lfmeadow/BabelStream into…

553b39a

… mywork

typo

c09b978

tomdeakin mentioned this pull request Aug 5, 2022

CUDA dot tuning #137

Closed

tomdeakin added a commit that referenced this pull request Oct 6, 2023

Set CUDA dot kernel to use number of blocks relative to device property

9954b7d

This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP) Cherry-picks the CUDA updates from lmeadows in #122

tomdeakin closed this Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

lfmeadow commented Feb 16, 2022 •

edited

Loading

tomdeakin commented Feb 16, 2022

lfmeadow commented Feb 16, 2022

lfmeadow commented Feb 17, 2022

lfmeadow commented Feb 18, 2022

lfmeadow commented Feb 21, 2022 •

edited

Loading

tomdeakin commented Feb 28, 2022

lfmeadow commented Feb 28, 2022 •

edited

Loading

Pennycook commented Feb 28, 2022

tomdeakin commented Jun 12, 2023

jeffhammond commented Jul 15, 2023

tom91136 commented Sep 5, 2023

tomdeakin commented Oct 6, 2023

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

Conversation

lfmeadow commented Feb 16, 2022 • edited Loading

tomdeakin commented Feb 16, 2022

lfmeadow commented Feb 16, 2022

lfmeadow commented Feb 17, 2022

lfmeadow commented Feb 18, 2022

lfmeadow commented Feb 21, 2022 • edited Loading

tomdeakin commented Feb 28, 2022

lfmeadow commented Feb 28, 2022 • edited Loading

Pennycook commented Feb 28, 2022

tomdeakin commented Jun 12, 2023

jeffhammond commented Jul 15, 2023

tom91136 commented Sep 5, 2023

tomdeakin commented Oct 6, 2023

lfmeadow commented Feb 16, 2022 •

edited

Loading

lfmeadow commented Feb 21, 2022 •

edited

Loading

lfmeadow commented Feb 28, 2022 •

edited

Loading