You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CublasXT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cublasXT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.
Starting with CUDA 6.0, a free version of cublasXT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.
The premier version of cublasXT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!
NVBLAS
NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cublasXT, and is included with the CUDA tookit. All versions of cublasXT work with NVBLAS.
AVAILABILITY
The free version of cublasXT is included with the CUDA Tookit in version 6.0 and beyond.
A free evaluation version of cublasXT Premier will be available to members of the CUDA Registered Developer Program.
The text was updated successfully, but these errors were encountered:
According to "New Features in CUDA 6 Make GPU Acceleration Easier", this can be done by simply re-linking or changing library load order of the NVBLAS introduced by the CUDA 6 using gcc myapp.c –lnvblas -lmkl_rt -o myapp or env LD_PRELOAD=libnvblas.so myapp.
https://developer.nvidia.com/cublasxt
CublasXT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cublasXT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.
Starting with CUDA 6.0, a free version of cublasXT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.
The premier version of cublasXT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!
NVBLAS
NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cublasXT, and is included with the CUDA tookit. All versions of cublasXT work with NVBLAS.
AVAILABILITY
The free version of cublasXT is included with the CUDA Tookit in version 6.0 and beyond.
A free evaluation version of cublasXT Premier will be available to members of the CUDA Registered Developer Program.
The text was updated successfully, but these errors were encountered: