Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

kloudkl · 2014-03-08T07:56:49Z

https://developer.nvidia.com/cublasxt

CublasXT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cublasXT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.

Starting with CUDA 6.0, a free version of cublasXT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.

The premier version of cublasXT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!

NVBLAS

NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cublasXT, and is included with the CUDA tookit. All versions of cublasXT work with NVBLAS.

AVAILABILITY

The free version of cublasXT is included with the CUDA Tookit in version 6.0 and beyond.

A free evaluation version of cublasXT Premier will be available to members of the CUDA Registered Developer Program.

kloudkl · 2014-05-03T08:30:21Z

According to "New Features in CUDA 6 Make GPU Acceleration Easier", this can be done by simply re-linking or changing library load order of the NVBLAS introduced by the CUDA 6 using gcc myapp.c –lnvblas -lmkl_rt -o myapp or env LD_PRELOAD=libnvblas.so myapp.

shelhamer · 2017-03-23T06:55:41Z

Closing as Caffe is pretty exacting in how it wants its memory to be handled and this is unlikely to be adopted at this point.

shelhamer mentioned this issue Mar 10, 2014

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

Closed

shelhamer mentioned this issue Apr 7, 2014

Training on Multiple GPU #301

Closed

kloudkl mentioned this issue Jun 30, 2014

Parallelize Forward / Backward by Depth #547

Open

kloudkl mentioned this issue Aug 6, 2014

Try to extract Convolution code from cuda-convnet2 #830

Closed

shelhamer closed this as completed Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

kloudkl commented Mar 8, 2014

kloudkl commented May 3, 2014

shelhamer commented Mar 23, 2017

Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

Accelerate Level 3 BLAS calls with multiple GPUs using cublasXT in CUDA 6.0 #194

Comments

kloudkl commented Mar 8, 2014

kloudkl commented May 3, 2014

shelhamer commented Mar 23, 2017