ChaNGa User Guide

Introduction

ChaNGa is a parallel N-body and Smooth Particle Hydrodynamics (SPH) solver that was initially designed for cosmological simulations of galaxy formation. While the gravity and SPH algorithms are closely based on previously developed codes, ChaNGa ’s unique feature is its implementation in the parallel language, Charm++. The motivation for creating the code was to address the challenges of efficiently performing large cosmological simulations, which are, in turn, required to model the gas and gravitational dynamics over large ranges in scales which are relevant for galaxy formation. While cosmology on large machines is the primary motivation, the code is suitable for addressing a wide range of gravitational or hydrodynamical problems, from isolated galaxies to protoplanetary disks. Furthermore, since even laptops are now multicore machines, the parallel machinery of Charm++ allows ChaNGa to effectively use individual computers as well as scaling up to tens of thousands of cores.

Purpose of this guide

The purpose of this guide is to introduce the user to the practical aspects of performing a simulation using ChaNGa . A description of the algorithms used, the code design, and performance tests can be found in @CharmApps. Here we give a quick overview, and then focus on obtaining, compiling, and running the code. We finish with an overview of software that can be used to analyze the results.

Overview of ChaNGa

The central feature of ChaNGa is a tree-based gravity solver using a variant of the Barnes-Hut @barnes86 algorithm. In order to efficiently perform cosmological gravity simulations it also includes:

Periodic boundary conditions: Ewald summation is used to model a finite volume within a periodic Universe.
Multistepping: particle trajectories are calculated on block, power-of-two timestep intervals where each particle has an individual timestep.

Gas dynamics and star formation are also important processes in galaxy formation. ChaNGa includes the following processes:

SPH using exact $k^{th}$ neighbor search.
Isothermal, adiabatic or cooling in primordial gas.
Star formation based on a density criterion
Energy, metal and mass Feedback from supernovea and stellar winds.

Acknowledgements

The collaboration between the Parallel Programming Laboratory at UIUC and the N-body shop at UW was initially funded by an NSF ITR grant. The NASA AISR program also funded some of the development of ChaNGa .

History

While the gravity algorithm can be simply described as a variant of the Barnes-Hut algorithm, Joachim Stadel’s implementation in PKDGRAV and PKGRAV2 made the algorithm more efficient with an optimized walk, and more accurate by extending the multipole forces to fourth order (hexadecapole). The SPH algorithms are take from James Wadsley’s implementation in GASOLINE@wadsley04. The star formation and feedback is based on Greg Stinson’s work of @Stinson06 which, in turn is based on the work of Neal Katz. Hence, ChaNGa is essentially the SPH and gravity algorithms taken from GASOLINE, with some gravity enhancements from PKDGRAV2.

License

The public release version of ChaNGa is currently being distributed under the Gnu Public License (GPL) version 2.

Getting started

Downloading ChaNGa

First you need to install Charm. Version 7.0.0 or later is required for the version 3.5 release of ChaNGa. The latest development version of Charm is needed for the latest development version of ChaNGa.

ChaNGa itself can be downloaded from the N-body shop site or from github. The most recent version of ChaNGa can be obtained from the git archive. If you install the development version, you will generally also need to install the development version of Charm.

When downloading from the git repository, you will need both the changa and the utility modules:

git clone http://github.com/N-BodyShop/changa.git
git clone http://github.com/N-BodyShop/utility.git

Compiling

Download the latest Charm with:

git clone http://github.com/UIUC-PPL/charm.git

It is best to make sure you compile the most up-to-date stable release. So before you compile, from within the charm directory do:

git checkout v7.0.0

However, if you are looking for the latest/greatest performance enhancements, or you are using the development version of ChaNGa, you may need to use the development branch. In this case do:

git checkout master

Building Charm++

Charm needs to be built first. This can be done with the build script in the top level of the charm distribution. ChaNGa uses a couple of Charm libraries that may not be built by default. To be sure that these are built, use ChaNGa as the first argument to the build command. A typical build command will be:

./build ChaNGa netlrts-linux-x86_64 --with-production

This example is for a 64 bit linux cluster that is only connected by ethernet, or an isolated multi-core machine. Usually one wants to use the high performance interconnect. A convenient way to use that interconnect is to build with the provided MPI libraries:

./build ChaNGa mpi-linux-x86_64 --with-production

Charm has native drivers for many high performance networks. A native Infiniband build can be gotten with:

./build ChaNGa verbs-linux-x86_64 --with-production

There are also drivers for most flavors of Crays: mpi-crayxt for the XT4 and XT5, gni-crayxe for the XE6/XK7 and "gni-crayxc" for the XC30.

For clusters that are composed of SMP nodes with many cores, charm can take advantage of the shared memory among cores to reduce messages and memory use. Adding smp as an argument to build will enable this feature. E.g., for an Infiniband cluster:

./build ChaNGa verbs-linux-x86_64 smp --with-production

Configure options

ChaNGa itself is compiled by first using the configure script to generate a Makefile. ./configure --help in the changa directory will give you the available options. Features of note include --enable-cooling: adiabatic gas is the default, so a cooling module can be selected with this option, (e.g., --enable-cooling=cosmo). For simulations with large dynamic range, greater than 1 million in spatial scales, --enable-bigkeys will be needed to handle the very deep trees.

configure needs to be able to find the charm distribution. If charm and changa are both subdirectories of a common directory, this will happen automatically. Otherwise set the environment variable CHARM_DIR to point at the charm source directory.

Make

Once configure runs successfully, make will compile the code, usually producing two executables: charmrun and ChaNGa. It is generally a good idea to do a make clean before running make after rerunning configure or updating from the git repository.

Running test problems

Some small test problems are included with the ChaNGa distribution to allow one to quickly determine that the program is operating properly.

teststep

In the subdirectory teststep there is a King model problem that checks whether energy is conserved while multistepping a King model realized with 36000 particles. Typing the command make test in this directory will run ChaNGa on this problem and compare the energy conservation with a benchmark integration. The energies should agree to three significant figures. By default, the test is run on two cores. Edit the file runtest.sh to run this test on a different number of cores.

testcosmo

In the subdirectory testcosmo is a small cosmological volume of an $\Omega = 0.3$, $\Lambda = 0.7$, $h = 0.7$ LCDM concordance cosmology realized with $48^3$ particles. The script run.sh will run this simulation to $z = 0$. The script mf.sh will run a halo finder on the $z = 0$ output and compare the results to a previously calculated mass function. The analysis script requires that SKID (a halo finder), SO (a second halo finder that depends on SKID) and IDL be available.

testshock

In the subdirectory testshock is a simple test of the gas dynamics using a shock tube. make test will run the shock tube simulation and create a file, pres.out that has pressure and density as a function of distance along the shock tube. This can be compared with a standard calculation in pres\_gasoline.out.

testcollapse

In the subdirectory testcollapse is a combined test of gravity and SPH via modelling the collapse of a self-gravitating sphere of adiabatic gas. make test will run the simulation, and calls tipsy to create a radial profile of the results. A python script is used to compare the resulting profile, in particular comparing the inner entropy profile and the outer velocity profile with a standard calculation.

###testonestar

Test energy and metal production from stellar evolution.

Running Example Simulations

Running an isolated galaxy

Running a cosmological volume

Running on parallel machines

A non-trivial aspect of running on multi-node parallel machines is getting the executable running on all the allocated nodes. For the MPI builds, this is done using the installed MPI infrastructure (usually mpirun or mpiexec). See the documentation specific for the cluster for details.

For net builds, the charmrun executable performs this task by using ssh to start ChaNGa on all the allocated nodes. charmrun needs to know what hosts have been allocated; this is specifed in a nodelist file which has the format: group main ++shell /usr/bin/ssh host node1 host node2 ... where node1 etc. are the available nodes. On a cluster with a batch queuing system, these nodes need to be obtained from the batch environment. Once the nodelist is constructed, ChaNGa is started with (e.g.): charmrun +p 128 ++nodelist nodelist ChaNGa test.param

For clusters with an MPI installation, charmrun can use the MPI infrastructure instead of using ssh. This is done using the ++mpiexec flag: charmrun ++mpiexec ChaNGa test.param

Load balancers

Part of the power of the charm run-time system is its ability to automatically balance the load among processors by moving parts of the computation around. In order for this to happen a load balancing strategy needs to be specified. There are serveral load balancers that work well for ChaNGa. For single stepping (i.e. all particles on the largest timestep) Orb3DLB_notopo or OrbLB can work well. For multistepping runs MultistepLB_notopo is an extension of Orb3D_notopo that handles multistepping. Loadbalancers are specified with the +balancer option, e.g.: charmrun ++mpiexec ChaNGa +balancer MultistepLB_notopo test.param

Checkpoint and restart

ChaNGa uses the object migration capabilities of charm for checkpointing: all the objects in the program are migrated to disk. They are stored in a large number of files (serveral per core) in a directory ending in either chk0 or chk1. The directories are alternated so that if there is a system crash in the middle of writing one directory, the other is still available. The last checkpoint written is indicated by an 0 or 1 written to the file lastcheckpoint. Restarts are done by specifying +restart and the restart directory, e.g.: charmrun ++mpiexec ChaNGa +restart test.chk0 Note that the param file is ignored on restart, and only a select group of parameters can be changed on restart.

Understanding log files and diagnostics

The log file

As well as providing diagnostics on how the simulation is progressing, the log file gives some provenence information. In particular, the first few lines of the log file give the version of ChaNGa, the command line that was used, the number of processors used, the date it was compiled and the compile options. Following that, all the values of all the run parameters are printed, one per line. After all the parameters are printed, one line is printed every iLogInterval timesteps. The line contains the simulation time, the redshift (if cosmological) various measures of energy, the total angular momentum, and the wallclock time of the last timestep. Note that the first line will contain the wallclock time of the first force calculation, while subsequent lines have the wallclock time of a big step.

The screen diagnostics

The screen output starts with information from the charm runtime system about about the system topology. This can be used to verify that you are running on the expected number and size of compute nodes. Following the ChaNGa version stamp is how ChaNGa has configured the nodes. This can be different than the Charm++ report if, for example you run multiple processes on SMP nodes rather than one process per node.

Following the startup information, timing information is printed for each phase of the calculation. For each smallest timestep there is Domain Decomposition (assigning particles to TreePiece objects), Load Balancing (assigning those pieces to processors to balance the processor load), building trees, and calculating gravity and SPH. Before each large step, the distribution of timestep rungs is also printed.

Load Balancing Diagnostics

One of the features of the Charm++ runtime system is the automatic load balancing framework, but there are circumstances where the framework does not work, and some tuning is needed to get good performance. The load balancer developed for ChaNGa outputs some diagnostics to indicate how the load balancer is performing. Consider the following diagnostics:

Orb3dLB_notopo stats: maxObjLoad 6.218762
Orb3dLB_notopo stats: minWall 24.518440 maxWall 24.580847 avgWall 24.551503 maxWall/avgWall 1.001195
Orb3dLB_notopo stats: minIdle 9.332150 maxIdle 15.885612 avgIdle 13.328586 minIdle/avgIdle 0.700161
Orb3dLB_notopo stats: minPred 7.976786 maxPred 16.010665 avgPred 10.970380 maxPred/avgPred 1.459445
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 92.000000 avgPiece 33.032258 maxPiece/avgPiece 2.785156
Orb3dLB_notopo stats: minBg 0.065539 maxBg 1.300723 avgBg 0.224308 maxBg/avgBg 5.798818
Orb3dLB_notopo stats: orb migrated 2435 refine migrated 0 objects

First, all units are in seconds. The maxObjLoad line is saying that the most computationally intensive piece of the tree is taking 6.2 seconds of CPU time. Compare this with the maxWall time on the next line and we see that this is a significant fraction of one processor's time. Looking at the minPred and maxPred numbers says that the next step is predicted to be badly load balanced: some processors have twice as much work to do as others. The minPiece and maxPiece numbers say the the load balancer has moved work around by quite a bit: one processor has 3 times as many particles while another processor only has about a 10th the average number of particles. This indicates that perhaps using more pieces may help with the load balancing. Also, the fact that maxPred is significantly smaller than maxWall indicates that there is significant unused idle time, perhaps due to message latency.

Getting Help

Common Problems

Compile Issues

TreePiece.cpp:(.text+0x62f9): undefined reference to hilbert3d_double You have changed the BIGKEYS configuration (``configure --enable-bigkeys'') without doing a make clean.

Running Issues

My simulation runs slowly! Find out why: there is now a script "timings_sph.awk" or "timings_grav.awk" that will summarize where the time is being spent based on the diagnostic output. ``awk -f timings_sph.awk < DIAG'' will give you a summary report.
My simulation is running out of memory! (The OOM killer!) There are a number of options that can reduce memory consumption. Note, however, that these are all usually memory/compute speed trade-offs. -maxbal X to keep the load balancer from going crazy. -d 3 to lower cache sizes. For SMP runs, fewer processes/node (more threads/process, i.e. increase ++ppn) can help. For example, say you are running on 128 BW nodes, then your aprun command looks like:

aprun -N 4 -n 512 -d 8 ChaNGa ++ppn 7 .... *.param

That's 4 processes/node which is probably optimal since that puts one process on each 8 core socket.

aprun -N 2 -n 256 -d 16 ChaNGa ++ppn 15 .... *.param

will put 2 process on each node, each using 15( +1 for communication) cores. Or

aprun -N 1 -n 128 -d 32 ChaNGa ++ppn 31 .... *.param

will put 1 process on each node, using 31 + 1 cores. This last option
uses the least memory, and is probably the slowest.

Initial conditions and parameters

Understanding Units

The key constraint for units is that ChaNGa requires G = 1. Hence if you choose two out of the three possible dimensions, the third is always chosen for you. The two dimensions that are specifiable in the parameter file are mass in "dMSolUnit" and length in "dKpcUnit". The time dimension is determined from these two. A simple example is from the solar system: if the mass unit is 1 Solar mass ("dMSolUnit = 1") and the length unit is 1 AU ("dKpcUnit = 4.8481368e-09"), then the time unit is (1 year/ (2 pi)), since the Earth moves one radian in one time unit. Another example is convenient unit system for galactic dynamics: with a time unit of 1 gigayear and a length unit of 1 kiloparsec (dKpcUnit = 1), the mass unit is required to be 2.2222858e5 Solar Masses ("dMSolUnit = 2.2222858e5"). Conveniently, 1 kpc/gigayear = .9778 km/s.

Units in cosmology

For cosmology, the Hubble parameter enters as a unit, so it needs to be consistent with the other unit choices, and with G = 1. Many choices are possible, but our convention is the following:

Boxsize=1=length unit
Lengths and velocities are comoving.
Velocities are peculiar from the Hubble flow.
The critical density of the Universe in simulation units is 1.

Implications:

The total mass in the box is rho/rho_crit = Omega_{matter} at z = 0.
From the Friedmann equation, H_0^2 = 8 pi rho_crit / 3; hence, H_0 = sqrt(8 pi/3) = 2.894405 in simulation units.
The velocity unit gotten from (Hubble velocity across the box/2.894405).
The time unit is 2.894405/(Hubble parameter).

Examples

Omega0 = 1 Box size = 100mpc at z=0 h=0.5

(just in case you are wondering: sigma8 and P(k) are not important)

mass unit= 6.398e16 solar masses

Total mass in the box = 1 code mass unit

box size = 100Mpc/(1+z)

Velocity unit= 1727.47/(1+z) = (50 100 /2.894405)/(1+z)

Omega0 = 0.3 (lambda or non lambda), Box size 100mpc at z=0, h=0.7

If H0=70Km/sec/Mpc units mass is (70/50)^2 6.938e16 = 13.6e16 Msol

Total mass in the box = Omega0 13.6e16 Msol =4.08e16

Velocity Unit: 70 100/2.894405 = 2418.46/(1+z) km/sec

Omega0 = 0.3 (lambda or non lambda), Box size 22.222mpc at z=0, h=0.7

If H0=70Km/sec/Mpc units mass is (22.222/100)^3 (70/50)^2 6.938e16 Msol = 1.492e15 msol

Total mass in the box = 4.476e14

Velocity Unit: 70 22.222/2.894405 = 537.43/(1+z) km/sec

Note: if got the units right for a given box and cosmology then the mass unit scales easily with boxsize^3 and the velocity unit is proportional to the boxsize for any other boxsizes.

Comoving Coordinates

Here be Dragons.

Going from comoving distance to physical distance is straightforward: r_phys = a r_cm. Physical velocity requires including the Hubble flow: v_phys = a (v_cm + H r_cm), where H is the hubble constant at the given redshift. Beware that v_circular is calculated differently, and therefore scales differently: v_c_phys = v_c_cm/sqrt(a). Finally, sometimes you want to know the ratio of the density to the critical density (e.g. to calculate the virial radius). This depends on cosmology. For Lambda + Matter critical Universes, we have rho/rho_c = rho_cm/(a^3 Omega_Lambda + Omega_matter), where the Omegas are present values and rho_c at the present is 1 as described above.