Optimize data structures for performance without losing readability/flexibility.

Having spent some time on implementation details of linear algebra topics, and a bit of time on small algorithmic improvements to a few numerics classes, I would now like to focus on another performance bottleneck we have. One that always seemed quite daunting to tackle (to me) due to the amount of changes it would require, but recently I had a silly idea to reduce the invasiveness of the "procedure".

Discontiguous data structures are absolute performance killers for low order FVM.
*(skip this paragraph if this is obvious)*
What data structures? All the small ones the solver needs to "visit" to compute residuals.
Why? Mostly because they destroy locality. The useful data stored by consecutive nodes is very far apart. Modern CPU's try to hide the latency of accessing main memory (RAM) by caching frequently used areas of that memory, this effort is very easily undermined by accessing memory at random or predictably but at very large strides.
Let me give you some numbers, for CNSVariable in 2D the average distance between the primitives at consecutive nodes is 200 doubles or 1600 bytes. "What?! is the compiler dumb?" No, virtual methods take up quite a bit of space `sizeof(CNSVariable)` is 424 bytes, before it allocates any data.
And that can only get worse as more functionality is added to the code.

This is one of those "well known" things at least in a qualitative way, but let us try and quantify the cost using an approximation of our typical use case scenario.
*Spoiler alert:* It is up to one order of magnitude, if this does not surprise you feel free to go straight to "Proposed solution".

We will have a variable class that stores some data with some overhead (the 200 doubles):
```
class VariableBase
{
  protected:
    double* primary;
  public:
    VariableBase(int nVar, int overhead);
    virtual ~VariableBase();
    inline double* GetPrimary() {return primary;}
    inline virtual double* GetSecondary() {return nullptr;}
};

VariableBase::VariableBase(int nVar, int overhead)
{
    primary = new double [nVar+overhead];
    for(int i=0; i<nVar; ++i) primary[i] = 1.0;
}
```
All this does is give access to some primary variables held by the base class and allow derived classes to store their own secondary variables, e.g.:
```
class VariableDerived : public VariableBase
{
  private:
    double* secondary;
  public:
    VariableDerived(int nVar, int overhead);
    ~VariableDerived();
    inline double* GetSecondary() {return secondary;}
};
```
Then lets say we have dummy numerics classes.
```
class NumericsBase
{
  public:
    virtual ~NumericsBase();
    inline virtual double ComputeResidual(const double* a, const double* b) {return 0.0;}
};

class NumericsDerived : public NumericsBase
{
  private:
    int n;
  public:
    NumericsDerived(int nVar);
    double ComputeResidual(const double* a, const double* b);
};
```
`NumericsDerived` simply computes the dot product of `a` and `b`.
The use case is to loop over edges that connect nodes `a` and `b` passing the data of those nodes to `ComputeResidual` and using the result to update an array.
```
VariableBase** node = new VariableBase* [nNode];
for(int i=0; i<nNode; ++i)
    node[i] = new VariableDerived(nVar,overhead);
...
for(int i=0; i<nEdge; ++i)
{
    int p0 = connectivity[i].first;
    int p1 = connectivity[i].second;

    double r = numerics->ComputeResidual(node[p0]->GetPrimary(), node[p1]->GetPrimary());

    residual[p0] += r;
    residual[p1] -= r;
}
```
I hope that looks familiar.
`connectivity` is a `vector<pair<int,int> >` so nicely contiguous (in SU2 this information comes from an array of CEdge*). Note that we are never talking about sequential access here, this level of indirection will always be present.
To create a somewhat realistic access pattern we create 2 edges per node connecting it to nodes +- some distance away, this distance is small for end nodes and largest for the middle node (I assumed the maximum distance to be `sqrt(nNode)`). This should vaguely resemble an RCM ordered 2D connectivity.
We will measure the cost of this edge loop relative to using a row-major storage of our variables.
```
double* solution = new double [nNode*nVar];
for(int i=0; i<nNode*nVar; ++i) solution[i] = 1.0;
...
    double r = numerics->ComputeResidual(&solution[p0*nVar], &solution[p1*nVar]);
...
```
OK, so we run this for 1000000 nodes with 4 variables per node and a overhead (or "space between variables") of 200 and we see that CVariable is **8.4** times slower.
But we can actually confuse the CPU further by accessing the variables through the virtual method, i.e.
```
...
    double r = numerics->ComputeResidual(node[p0]->GetSecondary(), node[p1]->GetSecondary());
...
```
and now we measure our loop to be **10** times slower.
If we remove all overhead the loop is only **1.5** and **1.9** times slower respectively, it is really the enormous distance that kills performance.

**Proposed solution**
Now, I would like to get better performance, but I do not want to re-write the entire code, and I recognize the usefulness of polymorphism and that encapsulating data keeps the code more manageable and readable.
To that end I propose that CVariable should allocate data for all nodes, so CSolver would have one CVariable and not one per node e.g.
```
class DenseVariableBase
{
  protected:
    double* primary;
    int n, i;
  public:
    DenseVariableBase(int nVar, int nNode)
    {
        primary = new double [nVar*nNode];
        ...
    }
    virtual ~DenseVariableBase() {delete [] primary;}
    inline DenseVariableBase* operator[] (int iNode)
    {
        i = iNode;
        return this;
    }
    inline double* GetPrimary() {return &primary[i*n];}
    inline virtual double* GetSecondary() {return nullptr;}
};
```
To keep the same semantics of accessing variables `node[iNode]->Get()` (and thus avoid changing the entire code) the class now keeps track of what node we are accessing (member variable `i`) and all its methods have this in consideration see for example `GetPrimary` above.
We tell the class what node we want via `operator []` which returns a pointer to the object itself so we can call other methods on top of `[]`.
This does not exactly recover the same interface and it can create some bugs because one would be able to call methods without knowing what index they are on (it is also not thread safe but that is not a problem now).
So the final piece to recover the interface and shield direct calls to methods without specifying the index is a lightweight wrapper.
```
class DenseWrapper
{
  private:
    DenseVariableBase* nodes;
  public:
    DenseWrapper(int nVar, int nNodes);
    ~DenseWrapper();
    inline DenseVariableBase* operator[] (int iNode)
    {
        return (*nodes)[iNode];
    }
};
```
With this there is no way of getting to the variables without going through `[]` first.
If we run this code we see that it is only **1.08** and **1.57** times slower for the non-virtual and virtual accessors respectively.
This could be an intermediate solution while the CVariable methods would be migrated to allow passing the index as argument e.g. `field->Get(iNode)` and the code is updated gradually.

However, I think the wrapper could provide a way of dealing with the loss of versatility that contiguous storage creates...
There are no free lunches and this performance improvement would come at the cost of losing the ability to have different types of variables in different parts of the domain. I think currently only the elasticity solver uses this to keep "boundary" and "internal" variables.
If this versatility is really required the wrapper can keep special variables separately, that it would know to be special via a boolean vector and know how to access via a hash table (for example).

**Alternatives**
We could make the storage continuous by using a memory pool i.e. allocate outside the class and give the class a pointer to its memory location. This is still not truly contiguous as the pointers the class holds are variables themselves and so they would still be separated (albeit by a smaller huge distance), thus the amount of overhead would still influence performance.

Right, that is enough writing... please let me know what you think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize data structures for performance without losing readability/flexibility. #716

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize data structures for performance without losing readability/flexibility. #716

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions