Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission. More importantly ... the performance_async_gpu test should pass!
Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.