Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor #725

Closed
Stefan-Dienst opened this issue Jul 14, 2019 · 4 comments
Closed

Comments

@Stefan-Dienst
Copy link

Stefan-Dienst commented Jul 14, 2019

Dear all,

I noticed a speed up while refactoring, which I don't understand.
In specific I measured this code block:

for (const auto [t, r] : triqs::utility::product(tmesh, rmesh)) {
    for (auto [A, a, B, b] : Gamma_pp_dyn_tr.target_indices())
      delta_tr_out[t, r](a, b) += -Gamma_pp_dyn_tr[t, r](A, a, B, b) * F_tr[t, r](A, B);
  }

Here all objects are one- or two- particle GFs in imaginary time and real-space.
I split up the loop into two and created temporary GFs in imaginary time.

auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr_out.target());
    auto Gamma_pp_dyn_t = make_gf<imtime>(tmesh_gamma, Gamma_pp_dyn_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    Gamma_pp_dyn_t = Gamma_pp_dyn_tr[_, r];
    F_t = F_tr[_, r];
    
    for (const auto t : tmesh) {
      for (auto [A, a, B, b] : Gamma_pp_dyn_tr.target_indices())
        delta_t[t](a, b) += -Gamma_pp_dyn_t[t](A, a, B, b) * F_t[t](A, B);
    }
    delta_tr_out[_, r] = delta_t;
  }

This version runs approximately 8 times as fast as the previous one and I tested this for different numbers of r-points.

Why is there such a big difference in computations time between those two versions by simply using temporaries?

Btw, I am using TRIQS with hash 331dd46.

@Stefan-Dienst
Copy link
Author

As a follow up, here are four isolated minimal functions that I tested.

  • Looping over time, space and orbitals, without temporaries
g_tr_t minimal_example_1(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  for (const auto t : tmesh) {
    for (const auto r : rmesh) {
      for (auto [A, a, B, b] : gamma_tr.target_indices()) {
        delta_tr[t, r](a, b) += -gamma_tr[t, r](A, a, b, B) * F_tr[t, r](A, B);
      }
    }
  }
  return delta_tr;
}
  • Looping over time, space BUT NOT orbitals, without temporaries
g_tr_t minimal_example_2(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  for (const auto t : tmesh) {
    for (const auto r : rmesh) {
      delta_tr[t, r](0, 0) += -gamma_tr[t, r](0, 0, 0, 0) * F_tr[t, r](0, 0);
    }
  }
  return delta_tr;
}
  • Looping over time, space and orbitals, WITH temporaries
g_tr_t minimal_example_3(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  auto tmesh_gamma = std::get<0>(gamma_tr.mesh());
  auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr.target());
    auto gamma_t = make_gf<imtime>(tmesh_gamma, gamma_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    for (const auto t : tmesh) {
      for (auto [A, a, B, b] : gamma_tr.target_indices()) {
        delta_t[t](a, b) += -gamma_t[t](A, a, b, B) * F_t[t](A, B);
      }
    }
    delta_tr[_, r] = delta_t;
  }
  return delta_tr;
}
  • Looping over time, space BUT NOT orbitals, WITH temporaries
g_tr_t minimal_example_4(chi_tr_vt gamma_tr, g_tr_vt F_tr) {

  auto delta_tr = make_gf(F_tr); 
  delta_tr *= 0;

  auto [tmesh, rmesh] = F_tr.mesh();
  auto tmesh_gamma = std::get<0>(gamma_tr.mesh());
  auto _ = all_t{};
  for (const auto r : rmesh) {

    auto delta_t = make_gf<imtime>(tmesh, delta_tr.target());
    auto gamma_t = make_gf<imtime>(tmesh_gamma, gamma_tr.target());
    auto F_t = make_gf<imtime>(tmesh, F_tr.target());

    for (const auto t : tmesh) {
        delta_t[t](0, 0) += -gamma_t[t](0, 0, 0, 0) * F_t[t](0, 0);
    }
    delta_tr[_, r] = delta_t;
  }
  return delta_tr;
}

Calling those functions in a small python script:

from pytriqs.gf import *
from triqs_tprf.lattice import minimal_example_1, minimal_example_2, minimal_example_3, minimal_example_4

BETA = 25
NR = 24
NT = 200
NORB = 3

tmesh = MeshImTime(beta=BETA, S='Fermion', n_max=NT)
tmesh_gamma = MeshImTime(beta=BETA, S='Boson', n_max=NT)

rmesh = MeshCyclicLattice(NR, NR, NR)

F_tr = Gf(mesh=MeshProduct(tmesh, rmesh), target_shape=(NORB,)*2)
gamma_tr = Gf(mesh=MeshProduct(tmesh_gamma, rmesh), target_shape=(NORB,)*4)

functions = [minimal_example_1, minimal_example_2, minimal_example_3, minimal_example_4]

for function in functions:
    function(gamma_tr, F_tr)

and profiling it using cProfile, I get the following output.

tottime    filename:lineno(function)
17.943   {triqs_tprf.lattice.minimal_example_1}
2.037    {triqs_tprf.lattice.minimal_example_3}
1.042    {triqs_tprf.lattice.minimal_example_4}
0.798    {triqs_tprf.lattice.minimal_example_2}

So version 2 and 4 should obviously be faster by having no loop over the orbitals. But here you can see, that the difference between version 1 and 3 is immense.

Using NR= 40, and NORB=1 gives :

tottime    filename:lineno(function)
2.132    {triqs_tprf.lattice.minimal_example_1}
2.018    {triqs_tprf.lattice.minimal_example_2}
1.926    {triqs_tprf.lattice.minimal_example_3}
1.783    {triqs_tprf.lattice.minimal_example_4}

where the computation time of all versions is similar.
So this speed-up effect of creating temporaries only exists for multiple orbitals.

@Stefan-Dienst
Copy link
Author

After profiling those minimal example together with @Wentzell, we found out, that a lot of time in the minimal_example_1 was spent in modulo operations.

Changing the line

linear_index_t index_to_linear(index_t const &i) const { return _modulo(i[0], 0) * s2 + _modulo(i[1], 1) * s1 + _modulo(i[2], 2); }

to

linear_index_t index_to_linear(index_t const &i) const { return i[0] * s2 + i[1] * s1 + i[2]; }

i.e. getting rid of the unnecessary modulo operations.
And then running the test from above, but this time calling every function 20 times, yields

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20   47.785    2.389   47.787    2.389 {triqs_tprf.lattice.minimal_example_1}
       20   38.317    1.916   38.319    1.916 {triqs_tprf.lattice.minimal_example_3}
       20   14.858    0.743   14.860    0.743 {triqs_tprf.lattice.minimal_example_4}
       20    8.697    0.435    8.698    0.435 {triqs_tprf.lattice.minimal_example_2}

giving at least somehow expected run times.

@Wentzell
Copy link
Member

Wentzell commented Jul 18, 2019

Hey Stefan,

Thank you for checking the timings without the modulo operations. It looks like the timings are now
much more on par. Unfortunately the proper fix will have to go further than the change in cluster_mesh.hpp as other codes seem to rely on this improper behavior.

We should keep this issue open until this has been addressed.

@Wentzell Wentzell changed the title 8x speed up by using temporary GFs in nested loop Unnecessary modulo operations in gf<cyclic_lattice> bracket accessor Jul 18, 2019
Wentzell added a commit to Wentzell/triqs that referenced this issue Aug 27, 2019
Wentzell added a commit that referenced this issue Aug 30, 2019
@Wentzell
Copy link
Member

This is fixed by #744

Stefan-Dienst added a commit to TRIQS/tprf that referenced this issue Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants