Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Stefan-Dienst · 2019-07-14T19:48:38Z

Dear all,

I noticed a speed up while refactoring, which I don't understand.
In specific I measured this code block:

for (const auto [t, r] : triqs::utility::product(tmesh, rmesh)) {
    for (auto [A, a, B, b] : Gamma_pp_dyn_tr.target_indices())
      delta_tr_out[t, r](a, b) += -Gamma_pp_dyn_tr[t, r](A, a, B, b) * F_tr[t, r](A, B);
  }

Here all objects are one- or two- particle GFs in imaginary time and real-space.
I split up the loop into two and created temporary GFs in imaginary time.

auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr_out.target());
    auto Gamma_pp_dyn_t = make_gf<imtime>(tmesh_gamma, Gamma_pp_dyn_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    Gamma_pp_dyn_t = Gamma_pp_dyn_tr[_, r];
    F_t = F_tr[_, r];
    
    for (const auto t : tmesh) {
      for (auto [A, a, B, b] : Gamma_pp_dyn_tr.target_indices())
        delta_t[t](a, b) += -Gamma_pp_dyn_t[t](A, a, B, b) * F_t[t](A, B);
    }
    delta_tr_out[_, r] = delta_t;
  }

This version runs approximately 8 times as fast as the previous one and I tested this for different numbers of r-points.

Why is there such a big difference in computations time between those two versions by simply using temporaries?

Btw, I am using TRIQS with hash 331dd46.

The text was updated successfully, but these errors were encountered:

Stefan-Dienst · 2019-07-16T23:59:45Z

As a follow up, here are four isolated minimal functions that I tested.

Looping over time, space and orbitals, without temporaries

g_tr_t minimal_example_1(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  for (const auto t : tmesh) {
    for (const auto r : rmesh) {
      for (auto [A, a, B, b] : gamma_tr.target_indices()) {
        delta_tr[t, r](a, b) += -gamma_tr[t, r](A, a, b, B) * F_tr[t, r](A, B);
      }
    }
  }
  return delta_tr;
}

Looping over time, space BUT NOT orbitals, without temporaries

g_tr_t minimal_example_2(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  for (const auto t : tmesh) {
    for (const auto r : rmesh) {
      delta_tr[t, r](0, 0) += -gamma_tr[t, r](0, 0, 0, 0) * F_tr[t, r](0, 0);
    }
  }
  return delta_tr;
}

Looping over time, space and orbitals, WITH temporaries

g_tr_t minimal_example_3(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  auto tmesh_gamma = std::get<0>(gamma_tr.mesh());
  auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr.target());
    auto gamma_t = make_gf<imtime>(tmesh_gamma, gamma_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    for (const auto t : tmesh) {
      for (auto [A, a, B, b] : gamma_tr.target_indices()) {
        delta_t[t](a, b) += -gamma_t[t](A, a, b, B) * F_t[t](A, B);
      }
    }
    delta_tr[_, r] = delta_t;
  }
  return delta_tr;
}

Looping over time, space BUT NOT orbitals, WITH temporaries

g_tr_t minimal_example_4(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  auto tmesh_gamma = std::get<0>(gamma_tr.mesh());
  auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr.target());
    auto gamma_t = make_gf<imtime>(tmesh_gamma, gamma_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    for (const auto t : tmesh) {
        delta_t[t](0, 0) += -gamma_t[t](0, 0, 0, 0) * F_t[t](0, 0);
    }
    delta_tr[_, r] = delta_t;
  }
  return delta_tr;
}

Calling those functions in a small python script:

from pytriqs.gf import *
from triqs_tprf.lattice import minimal_example_1, minimal_example_2, minimal_example_3, minimal_example_4

BETA = 25
NR = 24
NT = 200
NORB = 3

tmesh = MeshImTime(beta=BETA, S='Fermion', n_max=NT)
tmesh_gamma = MeshImTime(beta=BETA, S='Boson', n_max=NT)

rmesh = MeshCyclicLattice(NR, NR, NR)

F_tr = Gf(mesh=MeshProduct(tmesh, rmesh), target_shape=(NORB,)*2)
gamma_tr = Gf(mesh=MeshProduct(tmesh_gamma, rmesh), target_shape=(NORB,)*4)

functions = [minimal_example_1, minimal_example_2, minimal_example_3, minimal_example_4]

for function in functions:
    function(gamma_tr, F_tr)

and profiling it using cProfile, I get the following output.

tottime    filename:lineno(function)
17.943   {triqs_tprf.lattice.minimal_example_1}
2.037    {triqs_tprf.lattice.minimal_example_3}
1.042    {triqs_tprf.lattice.minimal_example_4}
0.798    {triqs_tprf.lattice.minimal_example_2}

So version 2 and 4 should obviously be faster by having no loop over the orbitals. But here you can see, that the difference between version 1 and 3 is immense.

Using NR= 40, and NORB=1 gives :

tottime    filename:lineno(function)
2.132    {triqs_tprf.lattice.minimal_example_1}
2.018    {triqs_tprf.lattice.minimal_example_2}
1.926    {triqs_tprf.lattice.minimal_example_3}
1.783    {triqs_tprf.lattice.minimal_example_4}

where the computation time of all versions is similar.
So this speed-up effect of creating temporaries only exists for multiple orbitals.

Stefan-Dienst · 2019-07-18T00:18:15Z

After profiling those minimal example together with @Wentzell, we found out, that a lot of time in the minimal_example_1 was spent in modulo operations.

Changing the line

triqs/triqs/lattice/cluster_mesh.hpp

Line 141 in 67cf435

    
           linear_index_t index_to_linear(index_t const &i) const { return _modulo(i[0], 0) * s2 + _modulo(i[1], 1) * s1 + _modulo(i[2], 2); }

to

linear_index_t index_to_linear(index_t const &i) const { return i[0] * s2 + i[1] * s1 + i[2]; }

i.e. getting rid of the unnecessary modulo operations.
And then running the test from above, but this time calling every function 20 times, yields

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20   47.785    2.389   47.787    2.389 {triqs_tprf.lattice.minimal_example_1}
       20   38.317    1.916   38.319    1.916 {triqs_tprf.lattice.minimal_example_3}
       20   14.858    0.743   14.860    0.743 {triqs_tprf.lattice.minimal_example_4}
       20    8.697    0.435    8.698    0.435 {triqs_tprf.lattice.minimal_example_2}

giving at least somehow expected run times.

Wentzell · 2019-07-18T14:36:15Z

Hey Stefan,

Thank you for checking the timings without the modulo operations. It looks like the timings are now
much more on par. Unfortunately the proper fix will have to go further than the change in cluster_mesh.hpp as other codes seem to rely on this improper behavior.

We should keep this issue open until this has been addressed.

…ar of cluster_mesh

… cluster_mesh

Wentzell · 2019-08-30T20:18:33Z

This is fixed by #744

Wentzell changed the title ~~8x speed up by using temporary GFs in nested loop~~ Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor Jul 18, 2019

Wentzell added a commit to Wentzell/triqs that referenced this issue Aug 27, 2019

FIX TRIQS#725 Remove unnecessary modulo operations from index_to_line…

dd88aec

…ar of cluster_mesh

Wentzell added a commit that referenced this issue Aug 30, 2019

FIX #725 Remove unnecessary modulo operations from index_to_linear of…

c671428

… cluster_mesh

Wentzell closed this as completed Aug 30, 2019

Stefan-Dienst added a commit to TRIQS/tprf that referenced this issue Aug 19, 2020

[eli] adjust to fix of TRIQS/triqs#725

3c66cee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Stefan-Dienst commented Jul 14, 2019 •

edited

Loading

Stefan-Dienst commented Jul 16, 2019

Stefan-Dienst commented Jul 18, 2019

Wentzell commented Jul 18, 2019 •

edited

Loading

Wentzell commented Aug 30, 2019

Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Comments

Stefan-Dienst commented Jul 14, 2019 • edited Loading

Stefan-Dienst commented Jul 16, 2019

Stefan-Dienst commented Jul 18, 2019

Wentzell commented Jul 18, 2019 • edited Loading

Wentzell commented Aug 30, 2019

Stefan-Dienst commented Jul 14, 2019 •

edited

Loading

Wentzell commented Jul 18, 2019 •

edited

Loading