[vulkan] Improve overall performance

Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission.  More importantly ... the performance_async_gpu test should pass!

Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.