forked from lgpang/clvisc
-
Notifications
You must be signed in to change notification settings - Fork 0
License
nikhilhatwar/clvisc
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
= CLVisc: a (3+1)D viscous hydrodynamic program parallelized on GPU using OpenCL = The program is used to simulate the evolution of strongly coupled quark gluon plasma produced in relativistic heavy ion collisions. Please cite the following paper if you used CLVisc for publications or reused part of its code, @article{Pang:2018zzo, author = "Pang, Long-Gang and Petersen, Hannah and Wang, Xin-Nian", title = "{Pseudorapidity distribution and decorrelation of anisotropic flow within CLVisc hydrodynamics}", year = "2018", eprint = "1802.04449", archivePrefix = "arXiv", primaryClass = "nucl-th", SLACcitation = "%%CITATION = ARXIV:1802.04449;%%" } Copyright (C) 2018, Long-Gang Pang This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>. ==Installation== First of all, get CLVisc using: git clone https://gitlab.com/snowhitiger/PyVisc.git 1. Install OpenCL (1) For MacBook Pro, OpenCL is supported by default, skip this step. (2) For Linux using Nvidia GPU, install CUDA -- Shipped with OpenCL. url: https://developer.nvidia.com/cuda-downloads (3) For Linux using AMD GPU, install AMD APP SDK from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/ (4) For super cluster with GPUs, ask the IT-help people for the OpenCL/Cuda support. 2. Download and install the latest Anaconda from https://www.continuum.io/downloads Important: please choose Python2.7 (although most of the code work well with Python3.*) Notice: in case you use Python2.7 from other sources, please also install matplotlib, h5py, pandas, sympy. These 4 packages are delivered with Anaconda by default. 3. Install PyOpenCL `conda install -c conda-forge pyopencl` Till now you can run ideal.py and viscous.py in pyvisc/ directory to run one ideal and one viscous hydro event, the hydrodynamic evolution will produce and print the evolution history and freeze out hyper-surface in result/ directory. In order to calculate smooth particle spectra or sample hadrons from hyper-surface, one needs to additionally install *cmake* and *gsl*. 4. Install cmake (1) For MacBook, Run in Terminal app: `ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null` and press *enter/return* key. Wait for the command to finish. Run: `brew install cmake` Done! You can now use cmake. (2) For Linux, search on google 5. Install gsl library (1) For MacBook, `brew install gsl` (2) For Linux, search on google 6. Event-by-event hydro using trento initial condition (1) Install trento cd 3rdparty/trento_with_participant_plane/ mkdir build cd build cmake .. make make install (2) Compile the MC sampling spectra calculation subroutines cd sampler/ mkdir build cd build cmake .. make (3) Compile the Smooth spectra calcualation subroutines cd CLSmoothSpec/ mkdir build cd build cmake .. make Notice: this step will fail with MacOS version > 10.8 because apple depreciated some OpenCL functions. Please use this subroutine on a GPU cluster/Linux machine. This program will be updated in the future to consider the MacOS updates. (4) In PyVisc/pyvisc/, modify the output path in ebe_trento.py and run #python ebe_trento.py collision_sys centrality gpu_id num_of_events python ebe_trento.py auau200 0_5 0 100 python ebe_trento.py pbpb2760 20_30 0 100 python ebe_trento.py pbpb5020 30_40 0 100 7. If there is error : No module named 'mako', one can install Mako using pip install --user Mako 8. Modify cache_dir in cache.py if the cluster does not have /tmp directory anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/cache.py 322 def _create_built_program_from_source_cached(ctx, src, options_bytes, 323 devices, cache_dir, include_path): 324 from os.path import join 325 326 if cache_dir is None: 327 import appdirs 328 #cache_dir = join(appdirs.user_cache_dir("pyopencl", "pyopencl"), 329 # "pyopencl-compiler-cache-v2-py%s" % ( 330 # ".".join(str(i) for i in sys.version_info),)) 331 cache_dir = '/lustre/nyx/hyihp/lpang/tmp/' ==Examples== 1. cd pyvisc python ideal.py 2. cd pyvisc python visc.py Notice: the visc.py has huge GPU memory demands and can only be run on GPUs whose memory > 5GB. 3. cd python modify ebe_trento.py to run event-by-event hydrodynamics with Trento initial condition ==The BSZ dependence== For ideal hydrodynamics, with lattice 385*385*115, per step running time is: BSZ 8 16 32 64 128 Ideal(s) 0.37 0.218 0.178 0.155 0.157 Visc(s)-GPU 3.12 1.65 1.17 1.01 1.17 Visc(s)-CPU 6.64 6.45 6.63 7.0 7.58 ==The importance of concurrent reading from Global memory.== Here I used NX=NY=NZ=201 for a test, in principle the time cost for visc_src_alongx, visc_src_alongy, visc_src_alongz should have no difference. However, from line profiler by using: {{{kernpro -l -v visc.py }}} One gets 41.9 vs 38.4 vs 6.9 for x, y and z direction. Why there is so big difference? It can be explained by the order of the data in global memory, where we use: {{{ for (int i = 0; i < NX; i++ ) for (int j = 0; j < NY; j++ ) for (int k = 0; k < NZ; k++ ) { pimn[i*NY*NZ + j*NZ + K] = some number; } }}} The data is continues along z direction, which makes it much faster to read from global memory to local memory in z direction than x and y due to concurrent. Total time: 19.2475 s File: visc.py Function: IS_stepUpdate at line 165 Line # Hits Time Per Hit % Time Line Contents ============================================================== 165 @profile 166 def IS_stepUpdate(self, step): 167 #print "ideal update finished" 168 52 152 2.9 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ 169 170 52 143954 2768.3 0.7 self.kernel_IS.visc_src_christoffel(self.queue, (NX*NY*NZ,), None, 171 52 134 2.6 0.0 self.d_IS_src, self.d_pi[step], self.ideal.d_ev[step], 172 52 581359 11180.0 3.0 self.ideal.tau, np.int32(step)).wait() 173 174 52 159298 3063.4 0.8 self.kernel_IS.visc_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1), 175 52 143 2.8 0.0 self.d_IS_src, self.d_udx, self.d_pi[step], self.ideal.d_ev[step], 176 52 8055724 154917.8 41.9 self.eos_table, self.ideal.tau).wait() 177 178 #print "udx along x" 179 180 51 156991 3078.3 0.8 self.kernel_IS.visc_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1), 181 51 151 3.0 0.0 self.d_IS_src, self.d_udy, self.d_pi[step], self.ideal.d_ev[step], 182 51 7381515 144735.6 38.4 self.eos_table, self.ideal.tau).wait() 183 184 #print "udy along y" 185 51 157382 3085.9 0.8 self.kernel_IS.visc_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ), 186 51 137 2.7 0.0 self.d_IS_src, self.d_udz, self.d_pi[step], self.ideal.d_ev[step], 187 51 1329880 26076.1 6.9 self.eos_table, self.ideal.tau).wait() 188 189 #print "udz along z" 190 51 302246 5926.4 1.6 self.kernel_IS.update_pimn(self.queue, (NX*NY*NZ,), None, 191 51 141 2.8 0.0 self.d_pi[3-step], self.d_goodcell, self.d_pi[1], self.d_pi[step], 192 51 82 1.6 0.0 self.ideal.d_ev[1], self.ideal.d_ev[2], self.d_udiff, 193 51 94 1.8 0.0 self.d_udx, self.d_udy, self.d_udz, self.d_IS_src, 194 51 978116 19178.7 5.1 self.eos_table, self.ideal.tau, np.int32(step) 195 ).wait() ==Usage of vloadn to speed up global data access== In kernel_visc.cl, one needs to load (pitt, pitx, pity, pitz) and (pixt, pixx, pixy, pixz) in src_alongx, needs to load (pitt, pitx, pity, pitz) and (piyt, piyx, piyy, piyz) in src_alongy, needs to load (pitt, pitx, pity, pitz) and (pizt, pizx, pizy, pizz) in src_alongz; Since the data are stored in for(i, j, k) order, so loading data along z is faster than along y. However, self.kernel_visc.kt_src_alongx is much faster than loading data along y. This may be caused by continues address for pixx, pixy, pizx. I tried to use vload4 but it does not speed up the code, which means the compiler already did the optimization. Total time: 8.60907 s File: visc.py Function: visc_stepUpdate at line 124 Line # Hits Time Per Hit % Time Line Contents ============================================================== 124 @profile 125 def visc_stepUpdate(self, step): 126 ''' Do step update in kernel with KT algorithm for visc evolution 127 Args: 128 gpu_ev_old: self.d_ev[1] for the 1st step, 129 self.d_ev[2] for the 2nd step 130 step: the 1st or the 2nd step in runge-kutta 131 ''' 132 # upadte d_Src by KT time splitting, along=1,2,3 for 'x','y','z' 133 # input: gpu_ev_old, tau, size, along_axis 134 # output: self.d_Src 135 108 324 3.0 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ 136 108 334097 3093.5 3.9 self.kernel_visc.kt_src_christoffel(self.queue, (NX*NY*NZ, ), None, 137 108 278 2.6 0.0 self.ideal.d_Src, self.ideal.d_ev[step], 138 108 162 1.5 0.0 self.d_pi[step], self.eos_table, 139 108 854615 7913.1 9.9 self.ideal.tau, np.int32(step) 140 ).wait() 141 142 108 296185 2742.5 3.4 self.kernel_visc.kt_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1), 143 108 275 2.5 0.0 self.ideal.d_Src, self.ideal.d_ev[step], 144 108 166 1.5 0.0 self.d_pi[step], self.eos_table, 145 108 1313623 12163.2 15.3 self.ideal.tau).wait() 146 147 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1), 148 108 251 2.3 0.0 self.ideal.d_Src, self.ideal.d_ev[step], 149 108 167 1.5 0.0 self.d_pi[step], self.eos_table, 150 108 2435409 22550.1 28.3 self.ideal.tau).wait() 151 152 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ), 153 108 261 2.4 0.0 self.ideal.d_Src, self.ideal.d_ev[step], 154 108 180 1.7 0.0 self.d_pi[step], self.eos_table, 155 108 1093978 10129.4 12.7 self.ideal.tau).wait() 156 157 # if step=1, T0m' = T0m + d_Src*dt, update d_ev[2] 158 # if step=2, T0m = T0m + 0.5*dt*d_Src, update d_ev[1] 159 # Notice that d_Src=f(t,x) at step1 and 160 # d_Src=(f(t,x)+f(t+dt, x(t+dt))) at step2 161 # output: d_ev[] where need_update=2 for step 1 and 1 for step 2 162 108 409861 3795.0 4.8 self.kernel_visc.update_ev(self.queue, (NX*NY*NZ, ), None, 163 108 277 2.6 0.0 self.ideal.d_ev[3-step], self.ideal.d_ev[1], 164 108 169 1.6 0.0 self.d_pi[0], self.d_pi[3-step], 165 108 147 1.4 0.0 self.ideal.d_Src, 166 108 1274723 11803.0 14.8 self.eos_table, self.ideal.tau, np.int32(step)).wait()
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Python 44.0%
- C++ 20.9%
- Jupyter Notebook 18.7%
- C 11.4%
- CMake 3.2%
- TeX 1.0%
- Other 0.8%