forked from lattice/quda
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
127 lines (91 loc) · 4.99 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
Release Notes for QUDA v0.2.5 24 June 2010
-----------------------------
Overview:
QUDA is a library for performing calculations in lattice QCD on
graphics processing units (GPUs) using NVIDIA's "C for CUDA" API.
This release includes optimized kernels for applying the Wilson Dirac
operator and clover-improved Wilson Dirac operator, kernels for
performing various BLAS-like operations, and full inverters built on
these kernels. Mixed-precision implementations of both CG and
BiCGstab are provided, with support for double, single, and half
(16-bit fixed-point) precision.
Software compatibility:
The library has been tested under Linux (CentOS 5.4 and Ubuntu 8.04)
using releases 2.3 and 3.0 of the CUDA toolkit. There are known
issues with CUDA 2.1 and 2.2, but 2.0 should work if one is forced to
use an older version (for compatibility with an old driver, for
example).
Under Mac OS X, the library fails to compile with CUDA 2.3 due to bugs
in the toolkit. It might work with CUDA 3.0, 2.2, or 2.0, but these
haven't been tested.
See also "Known issues" below.
Hardware compatibility:
For a list of supported devices, see
http://www.nvidia.com/object/cuda_learn_products.html
Before building the library, you should determine the "compute
capability" of your card, either from NVIDIA's documentation or by
running the deviceQuery example in the CUDA SDK, and set GPU_ARCH in
make.inc appropriately. Setting GPU_ARCH to 'sm_13' or 'sm_20' will
enable double precision support.
Installation:
In the source directory, copy 'make.inc.example' to 'make.inc', and
edit the first few lines to specify the CUDA install path, the
platform (x86 or x86_64), and the GPU architecture (see "Hardware
compatibility" above). Then type 'make' to build the library.
As an optional step, 'make tune' will invoke tests/blas_test to
perform autotuning of the various BLAS-like functions needed by the
inverters. This involves testing many combinations of launch
parameters (corresponding to different numbers of CUDA threads per
block and blocks per grid for each kernel) and writing the optimal
values to lib/blas_param.h. The new values will take effect the next
time the library is built. Ideally, the autotuning should be
performed on the machine where the library is to be used, since the
optimal parameters will depend on the CUDA device and host hardware.
They will also depend slightly on the lattice volume; if desired, the
volume used in the autotuning can be changed by editing
tests/blas_test.cu.
In summary, for an optimized install, run
make && make tune && make
(after optionally editing blas_test.cu). By default, the autotuning
is performed using CUDA device 0. To select a different device
number, set DEVICE in make.inc appropriately.
Using the library:
Include the header file include/quda.h in your application, link
against lib/libquda.a, and study tests/invert_test.c for an example of
the interface. The various inverter options are enumerated in
include/enum_quda.h.
Known issues:
* When the library is compiled with version 3.0 of the CUDA toolkit
and run on Fermi (GPU architecture sm_20), RECONSTRUCT_8 gives wrong
results in double precision. This appears to be a bug in CUDA 3.0,
since CUDA 3.1 beta has no such issue. Note that this problem isn't
likely to matter in practice, since RECONSTRUCT_8 generally performs
worse in double precision than RECONSTRUCT_12.
* Compiling in emulation mode with CUDA 3.0 does not with work when
the GPU architecture is set to sm_10, sm_11, or sm_12. As an
alternative, one can either compile for sm_13 (recommended) or use
CUDA 2.3. Note that NVIDIA is eliminating emulation mode completely
from CUDA 3.1 and later.
* When building for the 'sm_13' or 'sm_20' GPU architectures (which
enable double precision support), one of the stages in the build
process requires over 5 GB of memory. If too little memory is
available, the compilation will either take a very long time (given
enough swap space) or fail completely. In addition, the CUDA C
compiler requires over 1 GB of disk space in /tmp for the creation
of temporary files.
* For compatibility with CUDA, on 32-bit platforms the library is
compiled with the GCC option -malign-double. This differs from the
GCC default and may affect the alignment of various structures,
notably those of type QudaGaugeParam and QudaInvertParam, defined in
quda.h. Therefore, any code to be linked against QUDA should also
be compiled with this option.
Contact information:
For help or to report a bug, please contact Mike Clark
(mikec@seas.harvard.edu) or Ron Babich (rbabich@bu.edu).
If you find this code useful in your work, please cite:
M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving
Lattice QCD systems of equations using mixed precision solvers on
GPUs" (2009), arXiv:0911.3191 [hep-lat].
Please also drop us a note so that we may inform you of updates and
bug-fixes. The most recent public release will always be available
online at http://lattice.bu.edu/quda/