Skip to content

Latest commit

 

History

History
175 lines (137 loc) · 7.54 KB

README.md

File metadata and controls

175 lines (137 loc) · 7.54 KB

Batched Tridiagonal Systems Solver Library for Xilinx and Intel FPGAs

The Tridsolver-FPGA Library provides high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. The libray is based on the inexpensive Thomas algorithm with batching of multiple systems for solving smaller and medium sized systems and hybrid Thomas_PCR and Thomas_Thomas algorithms to solve larger systems. The HLS Techniques used to implement the Libray and data path for 3D ADI applications can be found here. The library currentry supports Xilinx and Intel FPGA devices and have been tested on Xilinx Alveo U280, Alveo U50 cards and Intel PAC D5005. The library and performance results are currenty under review for publication.

Representative applications

The library has been used to implement the 2D and 3D Heat diffusion application using FP32 and FP64 arithmetic. The implementation supports the batched computation of systems. The /FPGA/Xilinx directory consists the following varients of these applications targetting Xilinx FPGAs. Library and applications are implemented using C++ for Vivado.

ADI2D_F32 2D ADI application using FP32
ADI2D_F32 2D ADI application using FP64
ADI3D_F32 3D ADI application using FP32
ADI3D_F32 3D ADI application using FP64
ADI2D_TH_TH_F32 2D ADI application with Tiled Thomas-Thomas solver using FP32
ADI2D_THPCR_F32 2D ADI application with Tiled Thomas-PCR solver using FP32

/FPGA/Intel directory consits the batched thomas solver libray, Data path library and 2D ADI application using FP32 arithmetics targetting intel FPGAs. DPC++ is used to implement the library and application.

Application Implementations

Makefile based FPGA application implementation is supported. Optionally user can implement Application using Vitis GUI to target Xilinx FPGAs. In that case, user need to point the config file and set number of kernels. Here we note that separate config files are provided for U50 and U280 devices.

Following are the steps for Makefile based flow for the Xilinx FPGAs,

cd <application directory>

set the target config file(_u50.cfg or u280.cfg) in the Makefile

make build TARGET=<sw_emu/hw_emu/hw> PLATFORM=<FPGA platform>

make run TARGET=<sw_emu/hw_emu/hw> PLATFORM=<FPGA platform>

please make sure XRT setup.sh and Vitis settings64.sh scripts are sourced before using Makefile commands. E.g

source /disk1/Xilinx/Vitis/2019.2/settings64.sh

source /opt/xilinx/xrt/setup.sh

Application to target intel FPGAs can be compiled using following make file command. Target board is set as Intel PAC D5005.

make report/run_emu/hw

this requires Intel oneAPI toolkit as well as FPGA add on.

Performance comparison of Xilinx Accelaration Cards with Nvidia V100 GPU

The performance of Tridsolver-FPGA library on Xilinx FPGAs has been compared to performance of the same applications on Nvida V100 GPUs (using the Tridsolver GPU library by László et al. and NVIDIA's cuSPARSE). The following results are for the 2D and 3D Heat Diffusion Application implemented with the ADI technique and a Stochastic Local Volatility (SLV) model application, implemented with a Hundsdorfer-Verwer (HV) method for time integration.

Xilinx Alveo U50 Vs Nvidia V100

2D ADI Heat Diffusion Application Performance, 120 iter
FP32, v= 8, fCU=3, NCU=2 FP64, v= 8, fCU=3, NCU=2
3D ADI Heat Diffusion Application Performance, 100 iter
FP32, v= 8, NCU=4 FP64, v= 8, NCU=2
2D ADI Heat Diffusion Application on Larger Meshes, 100 iter
FP32, Thomas-Thomas solver, NCU=4 FP32, Thomas-PCR solver, NCU=4
SLV Application performance
40x20 Mesh, v = 1, NCU=2, FP64 100x50 Mesh, v = 1, NCU=2, FP64

Xilinx Alveo U280 Vs Nvidia V100

2D ADI Heat Diffusion Application Performance, 120 iter
FP32, v= 8, fCU=3, NCU=3 FP64, v= 8, fCU=3, NCU=3
3D ADI Heat Diffusion Application Performance, 100 iter
FP32, v= 8, NCU=6 FP64, v= 8, NCU=3
2D ADI Heat Diffusion Application on Larger Meshes, 100 iter
FP32, Thomas-Thomas solver, NCU=4 FP32, Thomas-PCR solver, NCU=4
SLV Application performance
40x20 Mesh, v = 1, NCU=3, FP64 100x50 Mesh, v = 1, NCU=3, FP64