Skip to content

Latest commit

 

History

History
391 lines (314 loc) · 19.1 KB

2020-06-11-bonded-forces.md

File metadata and controls

391 lines (314 loc) · 19.1 KB
layout title subtitle tags comments
post
Bonded forces in GROMACS and NB-LIB
Rewiring the dataflow in Gromac's implementation of bonded forces
PRACE
NB-LIB
true

The main purpose of NB-LIB was always to expose non-bonded force calculation capabilites. The "NB", after all, stands for "non-bonded". That didn't mean however that we, the NB-LIB team got to ignore the bonded part. We still needed a way to obtain bonded forces for the systems set up with our newly developed topology functionality. One way or another.

Making bonded forces available to NB-LIB

At the heart of it, calculating forces for different types of bonds is quite simple: for an array of input 3D coordinates, for each bond you need a pair of indices to refer to the two participating particles plus a handful of parameters like spring constants and equilibrium distances to calculate the bond forces. And this works just the same for forces due to angles and dihedrals. The commonality between these types of forces is that they act on predefined sets of coordinates. That's why in GROMACS, they are called "listed" interactions - there's a list which contains the indices along with the parameters for each bond, angle or dihedral. This list is normally constant during a simulation run and is thus considered part of the system topology.

NB-LIB implements a system topology and listed interactions are naturally a part of it. The data format looks like this:

template <class Interaction>
struct InteractionData
{
    std::vector<Index<Interaction>> indices;
    std::vector<Interaction>        parameters;
};

For each type of interaction, we store the interaction indices plus the interaction parameters. An example for Interaction would be HarmonicBond, which looks like this:

struct HarmonicBond : public std::tuple<real, real>
{
    real& forceConstant() { return std::get<0>(*this); }
    real& equilDistance() { return std::get<1>(*this); }
};

The Index traits class deduces to std::array<int, 3>, because for each harmonic bond, we need two ints for the coordinate indices and a third int to look up the bond parameters in the parameters vector. For angles and dihedrals, the Index trait adds an additional one or two ints to hold the additional coordinate indices. Finally, each type of interaction is stored in a std::tuple, such that the complete definition of all listed forces in the NB-LIB topology looks like this:

using ListedInteractions = std::tuple<InteractionData<HarmonicBond>, ..., InteractionData<HarmonicAngle>, ...>;

 

Listed interactions in GROMACS

How is the same data stored in GROMACS? The format looks like this:

struct InteractionDefinitions
{
    std::vector<t_iparams> iparams;
    std::array<std::vector<int>, F_NRE> il;
};

This includes all interaction types, i.e. t_iparams is a union type which can hold the parameters of any type. Thus iparams corresponds to the concatenation of the NB-LIB parameters vectors for all interaction types. The other member called il contains the indices for each interaction type, where F_NRE is the number of interaction types that GROMACS supports. Just like NB-LIB, GROMACS stores a vector of indices for each type. The only difference is that the value_type is always int for GROMACS, while for NB-LIB, it's std::array<int, N+1> for an N-center interaction type. In summary this means that in GROMACS too, a harmonic bond is represented as three integers, two for the coordinates and a third to look up the bond parameters in a look-up table. But the fundamental difference is: NB-LIB has a different (C++)-type for each interaction type while GROMACS uses a single (C++)-type for all kinds of interactions.

To convert or not to convert?

It seemed like a straight forward thing to implement. We were going to write a translation function to convert from ListedInteractions to GROMACS' InteractionDefinitions, call calc_listed(const InteractionDefinitions& idef) and voilà, there's our listed forces.

Pretty soon, however, we began to have some doubts. The fact that such a conversion would rely on many implementation details, such as the specific positions of each interaction type in the il array, for instance, was one concern. By design, such a layout translation would be inherently fragile with respect to future changes in the underlying code and create a future maintenance burden. Another issue was that calc_listed, the entry point to listed forces in GROMACS, had accumulated a lot of "baggage" over the years. For example, the argument signature of calc_listed contains a translation table to convert from (node)-local to global atom index. This table gets passed all the way down to the kernel level where forces for individual bonds are calculated. Its only purpose is to translate the local index to the global one in the error message for the case that the distance of a listed 1-4 interaction using tabulation exceeds the table limit. But only if either energies, virials or free energy are computed. Yet, the lookup table has to be part of the function signature for all the other interaction types as well due to the function pointer table (see below) requirint a uniform signature. Another example is the use of multi-purpose data structures, such as t_forcerec. Yes, calc_listed computes the listed forces, but where do these forces actually end up? Turns out there's one of those t_forcerecs in the argument list which holds a sub-object called bonded_threading that (among other things) has the force buffer we're looking for. The t_forcerec object also holds data and parameters related to non-bonded force computations and various hardware related information such as resources allocated for calculations on GPUs. In short, t_forcerec is used everywhere where forces are computed, which, in an MD code, is basically everywhere.

This means that if we wanted to obtain listed forces in NB-LIB through calc_listed, the only realistic option would be to create a t_forcerec object and fill it with dummy values except for the part actually relevant for bonded forces. At this point, we where asking ourselves:

What if we rewired the dataflow to feed the NB-LIB interaction data directly to the force kernels in GROMACS?

In fact, we quickly realized that we'd be doing ourselves a big disservice if we just threw away the interaction types by which we'd already neatly grouped all the listed interactions. C++ compilers have become powerful code generators, but they operate on types. If we don't have different types, we forego the possibility of leveraging the compiler to generate separate code paths optimized and tailored to each listed interaction type.

A type-aware approach to listed forces

In essence, the implementation of calc_listed(const InteractionDefinitions& idef, ...) in GROMACS looks like this:

void calc_listed(const InteractionDefinitions& idef, ...)
{
    // manage timing and multi-threading 

    for (int ftype = 0; ftype < F_NRE; ++type)
    {
        // branch out and descend stack for 2 intermediate functions based on
        // the type of interaction that ftype corresponds to
        // then call a function from a pointer table

        bondFunction* bonded = bondedInteractionFunctions[ftype]; 

        // compute all forces for ftype
        bonded(idef.iparams, idef.il[ftype], ...);
    }

    // reduce thread output
}

Since ftype is a runtime value, the whole interaction type dispatch that's controlled through it is managed by branches and function pointer tables, i.e. the implementation above contains a lot of control flow logic of the form if (someProperty(ftype)) {...}. If on the one hand each interaction type is its own C++ type, we additionally have overloads available to control flow or template instantiations to create separate code paths that don't require their flow to be controlled anymore.

On the other hand, we can't just loop over different C++ types. This is illustrated by the fact that we can't, for instance, write a for loop over the ListedInteractions tuple from above:

for (int i = 0; i < interactions.size(); ++i)
{
    // error: i is not a compile time constant
    calc_one_type(std::get<i>(interactions));
}

About 20 years ago when GROMACS was first written, the only practicable option would have been to manually unroll the tuple loop. Given that there's several dozen different types of interactions implemented in GROMACS, this would have been quite cumbersome and the union data type and function pointer table approach for InteractionDefinitions was the only feasible solution.

Luckily, things have changed. Consider the following code.

template<class Buffer>
auto reduceListedForces(const ListedInteractions& interactions,
                        const std::vector<gmx::RVec>& x,
                        Buffer* forces)
{
    std::array<real, std::tuple_size<ListedInteractions>::value> energies;

    // calculate one bond type
    auto computeForceType = [forces, &x, &energies](const auto& ielem) {
        real energy = computeForces(ielem.indices, ielem.parameters, x, forces);
        energies[FindIndex<std::decay_t<decltype(ilem)>, ListedInteractions>{}] = energy;
    };

    // calculate all bond types, returns a tuple with the energies for each type
    for_each_tuple(computeForceType, interactions);

    return energies;
}

With the help of a generic lambda and C++17's std::apply in the one-liner for_each_tuple, we can generate the loop over the different types in the tuple quite effortlessly. While reduceListedForces implements a loop over the interaction types, the next layer, computeForces implements a loop over all interactions of a given type:

template <class Index, class InteractionType, class Buffer>
real computeForces(const std::vector<Index>& indices,
                const std::vector<InteractionType>& iParams,
                const std::vector<gmx::RVec>& x,
                Buffer* forces)
{
    real Epot = 0.0;

    for (const auto& index : indices)
    {
        Epot += dispatchInteraction(index, iParams, x, forces);
    }

    return Epot;
}

We're now down to the level of individual bonds, angles and dihedrals. At this point, the next steps depend on the actual type of the interaction. But instead of dispatching each harmonic bond, cubic bond, harmonic angle and so on to their seperate paths just yet, we just differentiate based on the number of interaction centers for now. Through overload resolution, the appropriate version dispatchInteraction gets called now, such as this one for the case of 2-center interactions:

template <class Buffer, class TwoCenterType>
std::enable_if_t<IsTwoCenter<TwoCenterType>::value, real>
dispatchInteraction(const InteractionIndex<TwoCenterType>& index,
                    const std::vector<TwoCenterType>& bondInstances,
                    const std::vector<gmx::RVec>& x,
                    Buffer* forces,
                    PbcHolder pbc)
{
    int i = std::get<0>(index);
    int j = std::get<1>(index);
    const gmx::RVec& x1 = x[i];
    const gmx::RVec& x2 = x[j];
    const TwoCenterType& bond = bondInstances[std::get<2>(index)];

    gmx::RVec dx;
    // calculate x1 - x2 modulo pbc
    pbc.dxAiuc(x1, x2, dx);
    real dr2 = dot(dx, dx);
    real dr  = std::sqrt(dr2);

    real force = 0.0, energy = 0.0;
    std::tie(force, energy) = bondKernel(dr, bond);

    // avoid division by 0
    if (dr2 != 0.0)
    {
        force /= dr;
        detail::spreadTwoCenterForces(force, dx, &(*forces)[i], &(*forces)[j]);
    }

    return energy;
}

We can see that coordinate retrieval, computation of the scalar distance and the spreading of the scalar part of the force to the two centers is actually shared between all the different types of 2-center interactions. The only remaining thing to do now is to call the actual kernel to compute the force. Since bond has a distinct type, we can again use overload resolution:

template <class T>
auto bondKernel(T dr, const HarmonicBond& bond)
{
    return harmonicScalarForce(bond.forceConstant(), bond.equilDistance(), dr);
}

and call the actual kernel, which in its simplest form for a harmonic bond looks like this:

template <class T>
std::tuple<T, T> harmonicScalarForce(T k, T x0, T x)
{
    real dx  = x - x0;
    real dx2 = dx * dx;

    real force = -k * dx;
    real epot = 0.5 * k * dx2;

    return std::make_tuple(force, epot);

    /* That was 6 flops */
}

That's it! Not a single if(ftype) required. From here on, we just had to add a separate dispatch for the 3- to 5-center interactions and add the type-aware wrappers for all the different kernels implemented in GROMACS.

What have we achieved? We've

  • avoided future maintenance headaches by not writing a fragile translation layer
  • routed the listed force data flow through type-aware channels with
  • maximum code reuse and
  • control logic already baked into the type dispatch

And all of the above while leveraging the code from GROMACS that actually matters: the force kernels that implement all the physics!

Extra toppings

We could grab a cup of coffee now and focus again on the non-bonded part, which is NB-LIB's main objective, and we should! But, what if, like at the time of writing this, it's a rainy holiday and we knew nothing better to do than to continue experimenting a bit? Should we vectorize the code? And then there's also an idea that Berk Hess had, but never got around to try out. He said that, instead of processing all types of interactions separately, when an interaction involves coordinates i and j, one should also evaluate all the other types of interactions that involve these two coordinates. Seems like we have just the right tools now to give that a shot.

A quick roofline analysis

To address the first question, consider a simple harmonic bond for instance. In the GROMACS source, we can see that each of them costs 59 flops, which drops to 46 if we use the version above that doesn't do free energy perturbation. How many bytes of data do we have to access for that? Assuming single precision,

  • 6 floats = 24 bytes to load the two coordinates
  • 6 floats = 24 bytes to store the two forces
  • 3 ints = 12 bytes for the i, j, and parameter index
  • 2 floats = 8 bytes for the two bond parameters

Total: 68 bytes. A Skylake core for example has two AVX-512 units. Thus, it can perform up to 4 flops/cycle for non-vectorized code and up to 64 flops/cycle if both units perform 512-bit FMA instructions in single precision, or 10 Gflop/s in scalar mode and 160 Gflop/s vectorized, assuming a clock of 2.5 GHz. How much bandwidth would we need to sustain these processing speeds? According to the roofline model, performance P = β B, where β is the operational intensity in flops/byte and B denotes the bandwidth. For our example, β = 0.68, which means that we need a bandwidth of about 1.5 GB/s for every Gflop/s, so 15 GB/s for the scalar case and 240 GB/s for AVX-512.

The bottom line of these metrics is that if the number of atoms per CPU core is so small that the coordinate and force buffers fit into the L2 cache, it makes sense to vectorize the code. If the bond force computations are served from main memory, memory bandwith is the bottleneck, and if the data fits into L2, bandwidth is again the bottleneck if the code is vectorized (L2 data bandwidth on Skylake is 64 B/cycle).

Synthetic aggregate interaction types

Let's consider and angle between particles at positions i, j and k. Then there's almost certainly going to be two bonds i-j and j-k as well. After computing the angle forces when the three coordinates are already fetched from memory, we could also calculate those two bond forces, reuse the input coordinates and store back the sum of all the forces. If we managed to do that, the two bonds would essentially be getting a free ride in a bandwidth limited scenario.

One way of implementing this would be through an additional, synthetic aggregate interaction type:

template<class Angle, class Bond>
struct ThreeCenterAggregate
{
    Angle angle;
    Bond  bonds[2];
};

We can just append this type to the ListedInteractions tuple and send it down the pipeline like all the other interaction types. At the dispatchInteraction level we can then plug together the angle and bond components. The routines for that we already have lying around and we can select the right kernels based on the Angle and Bond types.

At setup time, we will have to go through all bonds and see whether we can find an appropriate angle, at a cost of NlogM, where N is the number of bonds and M the number of angles. This means that interactions are subtracted from the Angle and Bond lists and added to the one for ThreeCenterAggregate<Angle, Bond>. The best place to do this is in a ListedCalculator object. Its constructor receives the ListedInteractions from the topology, and then creates the aggregate transformations. By doing that, we can make aggregates an implementation detail of the ListedCalculator. Both the topology above and the whole reduceListedForces pipeline below ListedCalculator are oblivious to whether it chose to populate any aggregates or not.

Some remainders

There are a couple additional aspects to implementing listed forces, such as multi-threading, which we've already implemented, and calculating virials and free energy perturbations (FEP).

Free energies

The calculation of free energies, or more precisely, the derivative of the potential energy with respect to λ, which interpolates between two system states A and B, deserves a mention here. In GROMACS, each interaction type that supports FEP actually has two sets of parameters, one for the A state and an other one for the B state. The additional memory for the B state is always there in GROMACS, whether the user wants to do FEP or not.

Instead of adding a second B state to each interaction, one could also envisage just having two separate interactions for the A and B state. The advantage of that approach would be that the B state could be absent if the user doesn't want to do FEP. Also we'd be saving a lot of code, because we could just build two separate topologies A and B. They would then supply each a ListedInteractions tuple and we could add an overload to reduceListedForces that accepts a second B tuple and the λ parameters. The only additional ingredients needed are force kernels that take the second parameter set and λ as inputs, so the signature looks like this:

template <class T>
std::tuple<T, T, T> harmonicScalarForce(T kA, T kB, T xA, T xB, T x, T lambda);
// return is <F, V, dV/dl>

Of course, these kernels already exist in GROMACS.

With that I'd like to conclude and thank the whole NB-LIB team for the productive last two weeks.