Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nd_range to SYCL #83

Closed
wants to merge 1 commit into from
Closed

Conversation

jeffhammond
Copy link
Contributor

Add nd_range to SYCL to optimize for V100

Intel DPC++ uses a local size of 32, which is not optimal for STREAM-like kernels. PRK nstream analysis finds that 256 is appropriate for V100, and this setting works just fine on x86 platforms, among others.

Hades Canyon NUC verification with Intel DPC++ from oneAPI beta09.

$ dpcpp -DSYCL -O3 main.cpp SYCLStream.cpp

$ ./a.out --device 0
BabelStream
Version: 3.4
Implementation: SYCL
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
Driver: 2020.11.8.0.27
Reduction kernel config: 8 groups of size 8
Function    MBytes/sec  Min (sec)   Max         Average
Copy        20332.291   0.02640     0.02767     0.02659
Mul         20267.131   0.02649     0.02714     0.02658
Add         22927.175   0.03512     0.03744     0.03546
Triad       22780.968   0.03535     0.03585     0.03545
Dot         31974.622   0.01679     0.04985     0.02091

$ ./a.out --device 1
BabelStream
Version: 3.4
Implementation: SYCL
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Gen9 HD Graphics NEO
Driver: 20.18.16699
Reduction kernel config: 96 groups of size 256
Function    MBytes/sec  Min (sec)   Max         Average
Copy        32263.941   0.01664     0.02050     0.01698
Mul         32463.600   0.01654     0.01950     0.01690
Add         31267.789   0.02576     0.02953     0.02617
Triad       31988.618   0.02517     0.02876     0.02563
Dot         28149.343   0.01907     0.03302     0.02016

Signed-off-by: Jeff Hammond jeff.r.hammond@intel.com

…ative impact on x86)

Hades Canyon NUC verification with Intel DPC++ from oneAPI beta09.

$ dpcpp -DSYCL -O3 main.cpp SYCLStream.cpp

$ ./a.out --device 0
BabelStream
Version: 3.4
Implementation: SYCL
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
Driver: 2020.11.8.0.27
Reduction kernel config: 8 groups of size 8
Function    MBytes/sec  Min (sec)   Max         Average
Copy        20332.291   0.02640     0.02767     0.02659
Mul         20267.131   0.02649     0.02714     0.02658
Add         22927.175   0.03512     0.03744     0.03546
Triad       22780.968   0.03535     0.03585     0.03545
Dot         31974.622   0.01679     0.04985     0.02091

$ ./a.out --device 1
BabelStream
Version: 3.4
Implementation: SYCL
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device Intel(R) Gen9 HD Graphics NEO
Driver: 20.18.16699
Reduction kernel config: 96 groups of size 256
Function    MBytes/sec  Min (sec)   Max         Average
Copy        32263.941   0.01664     0.02050     0.01698
Mul         32463.600   0.01654     0.01950     0.01690
Add         31267.789   0.02576     0.02953     0.02617
Triad       31988.618   0.02517     0.02876     0.02563
Dot         28149.343   0.01907     0.03302     0.02016

Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>
@tomdeakin
Copy link
Contributor

tomdeakin commented Oct 26, 2020

Thanks that’s good to see we can get good performance with the same work-group size.

In most of the models in BabelStream we don’t tune that parameter as the implementation should pick a sensible default. I’d be interested to hear an argument that we should put this magic number in the code instead of the implementation putting it in as a sensible default. In other (plain) words, why isn’t this a PR against DPC++?

Aside: We do pick a work-group size for reductions using “sensible” heuristics we’ve found to work well. This is born out of necessity. For SYCL, as soon as I get time I’ll change this code to use SYCL 2020 provisional reductions instead which won’t require choosing a work-group size for SYCL. For the other models without reductions there isn’t an alternative without resorting to external libraries (like CUB for CUDA, etc), so we have to pick a consistent and hopefully good number.

Obviously these loops don’t need nd_range, so this change means breaking into the lower abstractions in SYCL. We don’t have barriers, and I guarantee that by using a range and not an nd_range.

I’d be interested in your take on this too.

@illuhad 's IWOCL talk showed some problems with using nd_range in library only implementations. The new hipSYCL backend should work better because the parallel scheme is better for CPUs.

I’m not against merging this @jeffhammond , I just want to make sure we are recommending best practice.

@jeffhammond
Copy link
Contributor Author

@jbrodman @Pennycook do you guys have anything to add here?

@tomdeakin
Copy link
Contributor

My P3HPCForum talk shows a useful set of SYCL numbers for Triad that are relevant. There are some updated results at my P3HPC SC Workshop talk. From memory, they SYCL numbers didn't change much between these two. The main reason I bring this up is because it gives data to talk through.

All the raw results are available.

An omission is hipSYCL (before the recent big update to the CPU backend) on the CPUs, and DPC++ on the NVIDIA GPUs. We used the oneAPI beta binary so didn't have to build it on the CPU systems which have RHEL 7-era glibc. Building LLVM-based things is super delicate and we didn't mange to get them to work.

@Pennycook
Copy link

I agree with @tomdeakin that you should open an issue against DPC++. It looks to me like the implementation of parallel_for there could use more sophisticated heuristics to determine the work-group size.

That said, I think there could be room in the BabelStream implementation to compare the range and nd_range forms of parallel_for, precisely to expose cases like these. The information is useful to implementers ("please do a better job of optimizing parallel_for) and to developers ("this implementation requires you to manually provide your work-group size").

@illuhad
Copy link
Contributor

illuhad commented Oct 26, 2020

The new hipSYCL backend should work better because the parallel scheme is better for CPUs.

Yes. nd_range performance is now "okay" (still somewhat slower than basic/hierarchical/scoped models) as long as there are no barriers. If you use barriers (or other forms of collective algorithms) you might feel synchronization overheads because on library-only CPU the cost of barriers relative to the work most SYCL codes have in a single work item is higher than on, say, GPUs.
I think we are still learning what the best practices are for performance portability if you can't do without nd_range. I can imagine that you might need to introduce some loop across multiple data elements per work item, in order to control and potentially increase the amount of work per work item for best performance. (Of course, this is in the end a poor man's implementation of hipSYCL's scoped parallelism model which follows that idea :) )

That said, I think there could be room in the BabelStream implementation to compare the range and nd_range forms of parallel_for, precisely to expose cases like these. The information is useful to implementers ("please do a better job of optimizing parallel_for) and to developers ("this implementation requires you to manually provide your work-group size").

Agree 100%. This is already available in SYCL-Bench (https://github.com/bcosenza/sycl-bench/ ) where we have tried to offer as many benchmarks as possible with multiple kernel invocation variants (basic/nd_range/hierarchical) precisely because of this motivation. But it certainly can't hurt to have it in BabelStream as well :)

@jeffhammond
Copy link
Contributor Author

Yes, DPC++ should be changed, but this is an important benchmark and it is reasonable to have a way for users to tune it that does not involve either changing all the kernel code or changing the compiler.

This is the DPC++ data on SKX+V100:

BlockSize Rate (MB/s)     Time (s)
1024      833487          0.00515301
512       834197          0.00514863
256       834491          0.00514681
128       834702          0.00514551
64        835527          0.00514043
32        654573          0.00656147

Unfortunately, DPC++ parallel_for uses BlockSize=32.

Finally, BabelStream has this for CUDA in the form of TBSIZE. If BabelStream is going to be used to benchmark NVIDIA hardware, SYCL should be granted the same parameters that CUDA has.

@illuhad
Copy link
Contributor

illuhad commented Oct 26, 2020

@jeffhammond but it's not really a SYCL issue - it's a DPC++ issue. hipSYCL and presumably ComputeCpp PTX backend will not suffer from that. I think this PR replaces a kernel invocation mechanism that we have precisely for performance portability and that BabelStream uses because it follows, in my opinion, best practices, with something that we know has inherent performance portability issues (and which I would not consider best practice). To me this seems like a workaround for a DPC++ bug at the expense of performance on unrelated SYCL implementations (e.g. hipSYCL CPU).

I think the correct way, as suggested by @Pennycook, is to have both an nd_range and a regular range implementation available. Then the user could select which one should be executed, and in addition, we could learn more about the performance characteristics of both models on various implementations.

@jeffhammond
Copy link
Contributor Author

@illuhad you know that hipSYCL and ComputeCpp will always select the global optimal block size for every processor they support? The fact that CUDA has always required users to set this suggests it is not easy.

@illuhad
Copy link
Contributor

illuhad commented Oct 26, 2020

@jeffhammond it's not easy (maybe even not solvable in general because you can probably construct cases where it might even depend on input data), and there might be cases where they don't select the perfect group size. In general I think it might be a good idea to introduce a mechanism in SYCL to provide an implementation with a hint about the group size for basic parallel for invocations.
A SYCL implementation is also free to provide mechanisms such as environment variables etc to alter the selection behavior of the group size.

At the same time basic parallel_for lives at a higher abstraction level compared to nd_range for a good reason, because this is what allows it to be more performance portable than nd_range. Inherent performance portability issues with nd_range have been known for a long time and massively plagued library-only implementations.

You gain 20% performance for DPC++ CUDA by switching to nd_range. But because of the inherent issues of nd_range, at the same time hipSYCL's CPU backend would lose at least 20%. Older hipSYCL versions would lose something like 10000x. This not only affects hipSYCL but any library-only implementation (e.g. triSYCL as well).

I'm not at at all saying that nd_range shouldn't be tested, I'm only saying it shouldn't be exclusively tested because it's known that it's not a good model for performance portability. And usually basic parallel_for works very well across implementations, so I think that using it should be considered best practice.

If you want performance portability while at the same time being able to specify group sizes you need a model like this, which is unfortunately however not yet available outside of hipSYCL: https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md

@jeffhammond
Copy link
Contributor Author

Yeah, we should definitely fix this in SYCL 2021. And I agree on the need for both implementations, not to mention USM support. I'll redo the PR to support both.

@Ruyk
Copy link
Contributor

Ruyk commented Oct 28, 2020

@jeffhammond but it's not really a SYCL issue - it's a DPC++ issue. hipSYCL and presumably ComputeCpp PTX backend will not suffer from that. I think this PR replaces a kernel invocation mechanism that we have precisely for performance portability and that BabelStream uses because it follows, in my opinion, best practices, with something that we know has inherent performance portability issues (and which I would not consider best practice). To me this seems like a workaround for a DPC++ bug at the expense of performance on unrelated SYCL implementations (e.g. hipSYCL CPU).

Just to clarify: ComputeCpp suffers from that on NVIDIA platforms, and to some extend in other platforms available on the CE. We have a simple heuristic that works on many cases, but we don't do anything fancy, specially not for NVIDIA platforms.

I think the correct way, as suggested by @Pennycook, is to have both an nd_range and a regular range implementation available. Then the user could select which one should be executed, and in addition, we could learn more about the performance characteristics of both models on various implementations.

Agree, I think is best to have both implementations available, the automatic vs manual selection is part of the aspects we should benchmark from different implementations.

If CUDA users can make a manual selection, then SYCL users should be able to do the same, specially on the same NVIDIA hardware. Even if we write an heuristic that works for V100 today, it may be a while before we can get one for A100 once there is support there, so manual parametrization is useful even to help tune implementations.

@jeffhammond
Copy link
Contributor Author

Aside: @Ruyk if you give me the test code you'd use to figure out the best config for A100, I can run that. I have access to everything from Kepler to Ampere, including the Volta variants (NVLink, PCIe, ARM SOC).

@Ruyk
Copy link
Contributor

Ruyk commented Oct 28, 2020

Thanks but I would probably find it easier to contribute something smarter to the DPC++ CUDA backend than get you some test code used in our implementation! :-)

@jbrodman
Copy link

@Ruyk Ping me when the PR is ready so I can approve. :)

@Ruyk
Copy link
Contributor

Ruyk commented Nov 3, 2020

Just wrote a quick patch, intel/llvm#2724, can someone test it on a decent NVIDIA gpu?

@jeffhammond
Copy link
Contributor Author

jeffhammond commented Nov 3, 2020 via email

@illuhad
Copy link
Contributor

illuhad commented Dec 16, 2020

hipSYCL is getting support for a hint regarding the choice of the group size in basic parallel_for kernels:
AdaptiveCpp/AdaptiveCpp#418

@tomdeakin tomdeakin added this to the v5.0 milestone Jun 22, 2021
@lfmeadow
Copy link

lfmeadow commented Feb 16, 2022

I'm late to the party but I don't see any significant difference between SYCL and CUDA scores on A100 once you accept the PR I just submitted :) . SYCL suffers from extra overhead in the buffer implementation, I pushed a new USM implementation that brings SYCL performance to within 1%. After fixing the poor block size choice for Dot in the CUDA version.
Cuda:

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1400479.230 0.00038     0.00040     0.00039     
Mul         1358066.660 0.00040     0.00041     0.00040     
Add         1364405.723 0.00059     0.00060     0.00060     
Triad       1374411.819 0.00059     0.00060     0.00060     
Dot         1357551.551 0.00040     0.00051     0.00041     

SYCLUSM:

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1392184.549 0.00039     0.00041     0.00039     
Mul         1350024.296 0.00040     0.00041     0.00040     
Add         1360041.424 0.00059     0.00077     0.00060     
Triad       1369403.300 0.00059     0.00063     0.00060     
Dot         1341409.323 0.00040     0.00044     0.00041     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants