Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thermostat noise for different seeds is correlated #3585

Closed
fweik opened this issue Mar 17, 2020 · 35 comments · Fixed by #3888 · May be fixed by #3884
Closed

Thermostat noise for different seeds is correlated #3585

fweik opened this issue Mar 17, 2020 · 35 comments · Fixed by #3888 · May be fixed by #3884

Comments

@fweik
Copy link
Contributor

fweik commented Mar 17, 2020

Currently the thermostat noise for different seeds is correlated, the seed just shifts the sequence.
Currently the seed is used to offset the counter, but it should be used in the key to get independent
sequences. Also there should only be one counter...

@jngrad
Copy link
Member

jngrad commented Mar 17, 2020

We could use the user-provided seed value as the offset of a Philox counter and take the random number at that index as the seed for the thermostats.

@KaiSzuttor
Copy link
Member

we should start with writing a failing test

@fweik
Copy link
Contributor Author

fweik commented Mar 17, 2020

The obvious and simple solution is to use the seed as part of the key. This would also allow to just use one counter for all the RNGs (as a global propagation parameter).

@fweik
Copy link
Contributor Author

fweik commented Mar 17, 2020

Also the salt should be part of the key, not the counter....

@fweik
Copy link
Contributor Author

fweik commented Mar 17, 2020

I'll propose a fix.

@fweik fweik self-assigned this Mar 17, 2020
@fweik
Copy link
Contributor Author

fweik commented Mar 17, 2020

There is a sketch in https://github.com/fweik/espresso/tree/counter, which this the thermostats can be turned into proper classes, because seed just a regular parameter, which allows for removing all of the coursed constructions around the thermostats. I'm not going to work anymore on this in my free time...

@fweik
Copy link
Contributor Author

fweik commented Mar 17, 2020

The branch lacks the python interface, which can also be considerably simplified. The counter needs to be added to the state (this is why the checkpointing tests fail), and the cython stuff needs to be adjusted.

@fweik fweik removed their assignment Mar 17, 2020
@jngrad
Copy link
Member

jngrad commented Mar 18, 2020

You've also added an extra seed argument to all functions in src/core/random.hpp, but that seed isn't used.

@fweik
Copy link
Contributor Author

fweik commented Mar 20, 2020

Ah right. I'll finish this next week.

@fweik fweik self-assigned this Mar 23, 2020
@jngrad
Copy link
Member

jngrad commented Jun 23, 2020

Any news on this? The proposed fix simplifies the core a lot and would be a nice addition to 4.1.3.

@fweik
Copy link
Contributor Author

fweik commented Jun 23, 2020

I forgot about this, sorry. I'll take a look and asses how much work this still is. Then we can decide on a course of action.

@fweik
Copy link
Contributor Author

fweik commented Jun 23, 2020

The plan here was to use a single monotonic counter as parametric time, which is propagated once per integration step? Does anybody see a problem with such a design? @RudolfWeeber?

@jngrad
Copy link
Member

jngrad commented Jun 23, 2020

Looking at your WIP I don't see an issue on 4.2.0, but for 4.3.0 with walberla I'm not sure.

The only side effect I can think of in 4.2.0, is that disabling a thermostat and re-enabling it later with the same seed will produce a different trajectory, if I understand your code correctly. This will make reproducing results more difficult, unless we provide a mechanism to reset the global counter.

@fweik
Copy link
Contributor Author

fweik commented Jun 23, 2020

This will make reproducing results more difficult, unless we provide a mechanism to reset the global counter.

Apparently this is the desired behavior, so this is by design.

@fweik
Copy link
Contributor Author

fweik commented Jun 23, 2020

Looking at the thermostat code, I found more bugs.

@fweik
Copy link
Contributor Author

fweik commented Jun 28, 2020

I won't have time to work on this.

@fweik fweik removed their assignment Jun 28, 2020
@jngrad
Copy link
Member

jngrad commented Jun 28, 2020

Could you tell me more about the other thermostat bug you found? I'd be interested in having a look at it.

@fweik
Copy link
Contributor Author

fweik commented Jun 28, 2020

No unfortunately I don't remember. I should have taken notes.

@fweik
Copy link
Contributor Author

fweik commented Jun 28, 2020

Let's talk on Monday, I have looked into this, and in principle it is solvable. They old WIP is of no use anymore, because too much of the relevant code changed, I have started a new one, providing the infrastructure, e.g. exposing all bits of the key in the philox wrapper and adding the evolution parameter as a new global. This could now be used for the thermostats, but I looked at design
and I'm not sure if this should be touched before the design isn't improved further. E.g. it is not clear to me if and how the RNG seeds are checkpointed.

@jngrad jngrad added this to the Espresso 4.2 milestone Jun 29, 2020
@jngrad
Copy link
Member

jngrad commented Aug 31, 2020

we should start with writing a failing test

not a unit test, but illustrates the issue:

diff --git a/src/core/unit_tests/thermostats_test.cpp b/src/core/unit_tests/thermostats_test.cpp
index 45eecdce6..dc30b7142 100644
--- a/src/core/unit_tests/thermostats_test.cpp
+++ b/src/core/unit_tests/thermostats_test.cpp
@@ -330,3 +330,18 @@ BOOST_AUTO_TEST_CASE(test_npt_iso_randomness) {
   }
 }
 #endif // NPT
+
+BOOST_AUTO_TEST_CASE(test_autocorrelation) {
+  auto thermostat1 = thermostat_factory<LangevinThermostat>();
+  auto thermostat2 = thermostat_factory<LangevinThermostat>();
+  thermostat1.rng_initialize(42);
+  thermostat2.rng_initialize(43);
+  auto p = particle_factory();
+  for (int i = 0 ; i < 10; ++i) {
+    thermostat1.rng_increment();
+    thermostat2.rng_increment();
+    auto p1 = friction_thermo_langevin(thermostat1, p);
+    auto p2 = friction_thermo_langevin(thermostat2, p);
+    printf("t = %d p1[0] = %+6.3f p2[0] = %+6.3f\n", i, p1[0], p2[0]);
+  }
+}

output:

69: t = 0 p1[0] = +5.946 p2[0] = -5.307
69: t = 1 p1[0] = -5.307 p2[0] = -4.884
69: t = 2 p1[0] = -4.884 p2[0] = -0.353
69: t = 3 p1[0] = -0.353 p2[0] = -5.667
69: t = 4 p1[0] = -5.667 p2[0] = -4.504
69: t = 5 p1[0] = -4.504 p2[0] = +2.516
69: t = 6 p1[0] = +2.516 p2[0] = +0.286
69: t = 7 p1[0] = +0.286 p2[0] = -4.657
69: t = 8 p1[0] = -4.657 p2[0] = -5.201
69: t = 9 p1[0] = -5.201 p2[0] = -1.721

The only way to test this is to run an autocorrelation function, which isn't very convenient in a unit test. We should also check autocorrelation on the ordered random sequences to reveal bugs such as the NPT one that took me, Flo and Rudolf quite a while to figure out when refactoring the NPT code where we call the same random function twice in a single time step (this bug never made it into a PR, fortunately), but this would need some special treatment since ordered sequence are naturally autocorrelated... Or, we just check the random sequences match the Philox sequences as suggested by Flo.

@jngrad
Copy link
Member

jngrad commented Aug 31, 2020

The bonded interactions checkpointing mechanism in set up in such a way that exceptions are silently ignored. This makes it really hard to track down a seed checkpointing bug I just ran into (maybe this is what @fweik mentioned above). The seed checkpointing mechanism is really cryptic.

--- a/src/python/espressomd/interactions.pyx
+++ b/src/python/espressomd/interactions.pyx
@@ -1720,6 +1720,10 @@ cdef class BondedInteraction:
 
     property params:
         def __get__(self):
+            print(">>> ", self._params)
+            if 'seed' in self._params and self._params['seed'] == 0:
+                print('The exception on the next line will be silently ignored')
+                raise TypeError()
             return self._params
 
         def __set__(self, p):
@@ -3358,10 +3362,14 @@ class BondedInteractions:
         params = {}
         for i, bonded_instance in enumerate(self):
             if hasattr(bonded_instance, 'params'):
+                print('Read parameters...')
                 params[i] = bonded_instance.params
+                print('Read successful')
                 params[i]['bond_type'] = bonded_instance.type_number()
             else:
+                print('Object has no parameters!')
                 params[i] = None
+        print('::: ', params)
         return params
 
     def __setstate__(self, params):

output when running into the seed checkpointing bug that resets the seed to 0:

Read parameters...
>>>  {'k': 1.0, 'r_0': 0.0, 'r_cut': 0.0}
Read successful
>>>  {'temp_com': 0.0, 'gamma_com': 0.0, 'temp_distance': 0.2, 'gamma_distance': 0.5, 'r_cut': 2.0, 'seed': 0}
The exception on the next line will be silently ignored
Object has no parameters!
:::  {0: {'k': 1.0, 'r_0': 0.0, 'r_cut': 0.0, 'bond_type': 1}, 1: None}
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

@KaiSzuttor
Copy link
Member

why do you think that the __get__ method is called for checkpointing?

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Sep 1, 2020 via email

@KaiSzuttor
Copy link
Member

KaiSzuttor commented Sep 1, 2020

wouldn't it be better to have a single implementation of the generation of random numbers which can be unit tested?

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Sep 1, 2020 via email

@jngrad
Copy link
Member

jngrad commented Sep 1, 2020

why do you think that the __get__ method is called for checkpointing?

I looks like it's the mechanism currently used for checkpointing bonded interactions objects, but I could be wrong. From what I understood, the __get__ method of the params member is called on all objects to be checkpointed in the __getstate__ method of the BondedInteractions class that handles them. The error probably happens in the hasattr

def __getstate__(self):
params = {}
for i, bonded_instance in enumerate(self):
if hasattr(bonded_instance, 'params'):
params[i] = bonded_instance.params
params[i]['bond_type'] = bonded_instance.type_number()
else:
params[i] = None
return params

It is clear that that approach is better, but it is also much more work.

I'll have a look once the refactor is ready. Right now we have 3 different thermostat implementations in the core, so it's not clear to me yet which path is feasible.

@RudolfWeeber
Copy link
Contributor

But if we decide to pull out the RNG and key generation, then THAT is the refactoring that should be done now.
Or maybe I don't understand what you are attempting with your current refactor.

@KaiSzuttor
Copy link
Member

it is not the first bug we encounter since we switched to the counter-based RNG and we are still not sure if it is worth to pull out the logic and write unit tests?

@jngrad
Copy link
Member

jngrad commented Sep 1, 2020

it is not the first bug we encounter since we switched to the counter-based RNG and we are still not sure if it is worth to pull out the logic and write unit tests?

All functions generating uniform and gaussian noise are checked for correlation in src/core/unit_tests/random_test.cpp. During this refactor I've added an extra check for autocorrelation at lag time n for seeds offset by n to capture the bug in this ticket. It's not as good as checking all lag times with a full autocorrelation function, but I don't think we need to check that case (Philox itself should have tests regarding autocorrelation) and we could only do that in a python test. There are already tests for thermostats in src/core/unit_tests/thermostats_test.cpp that check the correlation in every thermostat noise functions. The thermostats are created locally in the test using a template factory. We can pull out the random code into a library outside of src/core, but this will only pull out random.hpp and the BaseThermostat class of thermostat.hpp. The rest of thermostat code is too coupled to the core because of the physics behind it, e.g. it provides methods to calculate prefactors derived from the gamma friction coefficients.

wouldn't it be better to have a single implementation of the generation of random numbers which can be unit tested?

The RNG implementation is currently made of free functions in random.hpp that take a counter, a seed and two keys as arguments. Once I'm done with this ticket, there will be only one counter and each thermostat class will only store a seed and provide methods to calculate prefactors from gammas.

Separate out the particle->key mapping

I'm not sure what you mean here. My understanding is that we need the key mechanism to provide the same random numbers to the same particle id pair.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Sep 1, 2020 via email

@jngrad
Copy link
Member

jngrad commented Sep 1, 2020

At the moment we use the same function for the single particles and particle pairs. In the particle pair case, either the particle ids order matters (thermalized bond), or it doesn't matter and the ids are sorted (DPD). We could refactor the free functions in random.hpp to take a key as argument instead of particle ids, and write a method that derives that key from the particle ids in the thermostat structs. That would help with abstraction in random.hpp and hide implementation details of the key derivation in dpd.cpp and thermalized_bond.hpp.

@jngrad
Copy link
Member

jngrad commented Sep 1, 2020

Alright the WIP is in jngrad:thermostats-global-counter. All thermostats now derive from BaseThermostat and depend on a global, monotonically-increasing counter (even Stokesian Dynamics) that cannot be altered by the user, but gets checkpointed. All sources of correlation we know of should be gone. All thermostats share the same interface, except for LB GPU.

We can now decide whether we want to abstract the key generation for single particles/particle pairs, replace thermostat unit tests by statistical python tests, pull the free functions for random number generation out of the core, refactor the python class for thermostats (e.g. like in integrate.pyx), split the core classes for thermostats in separates files so that they don't know about each other, etc.

@jngrad
Copy link
Member

jngrad commented Sep 2, 2020

By the way, the boost::optional guard around the RNG seed seems unnecessary to me. It looks like it only helps in the python interface, to spare the user from providing the same seed again when updating the thermostat temperature. This could be handled in the python class directly by storing the seed. Better yet, we could make it mandatory to provide the same seed again. Users should no longer use warmup, but instead rely on the steepest descent algorithm. If warmup is really needed, we could look into creating a method to update the temperature global. This would reduce code duplication in the python Thermostat class. We could even think about having a single global temperature, instead of two (temperature and sd_kT).

This boost::optional forces us to write two implementations of the BaseThermostat class for LB GPU because of bad compiler support with Clang 9 and CUDA 10, and we can't use std::optional (C++17 feature). Furthermore, keeping the seed in memory even after we "clear" the thermostat list with system.thermostat.turn_off() means all thermostat objects have the lifetime of the simulation. Without this optional, we could engineer a better thermostat infrastructure.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Sep 2, 2020 via email

@jngrad
Copy link
Member

jngrad commented Sep 3, 2020

As Flo pointed out, the system doesn’t have a temperature. Algorithms which use one as parameter, should store it themselves.

Good point. Let's discuss the thermostat infrastructure in the next meeting.


Going back to the original issue, here is a MWE to reproduce the bug in a simulation:

import espressomd
system = espressomd.System(box_l=[1.0, 1.0, 1.0])
system.cell_system.skin = 0
system.time_step = 0.01
system.part.add(pos=[0, 0, 0], ext_force=[0, 0, -1])
kT = 1.1
gamma = 3.5

print("seed = 42")
system.thermostat.set_langevin(kT=kT, gamma=gamma, seed=42)
for i in range(4):
    system.integrator.run(1)
    print(system.part[0].pos if i else '...skip...')
    system.part[0].pos = [0, 0, 0]
    system.part[0].v = [0, 0, 0]

print("seed = 43")
system.thermostat.set_langevin(kT=kT, gamma=gamma, seed=43)
for i in range(3):
    system.integrator.run(1)
    print(system.part[0].pos)
    system.part[0].pos = [0, 0, 0]
    system.part[0].v = [0, 0, 0]

output:

seed = 42
...skip...
[ 0.00238167  0.00132846 -0.0004863 ]
[-0.00212575  0.00104275 -0.00153655]
[-0.00195621 -0.00193643 -0.00231657]
seed = 43
[ 0.00238167  0.00132846 -0.0004863 ]
[-0.00212575  0.00104275 -0.00153655]
[-0.00195621 -0.00193643 -0.00231657]

Although the random sequences are shifted between two simulations, the initial conditions (particle positions and velocities) are necessarily different, thus if at least two particles in the system are within interaction range, the initial forces will be different in Langevin Dynamics. Maybe this RNG correlation could be an issue in Stokesian Dynamics and Brownian Dynamics simulations?

@kodiakhq kodiakhq bot closed this as completed in #3888 Sep 18, 2020
kodiakhq bot added a commit that referenced this issue Sep 18, 2020
Description of changes:
- Remove RNG correlation stemming from seed offsets (fixes #3585)
    - seeds are now used as keys
    - a monotonically increasing counter is used in each thermostat
    - the only way to reset these counters is to create a new `System`
- Remove RNG correlation stemming from resetting `sim_time` or `time_step` during simulations with SD (fixes #3840)
    - the SD thermostat now uses the same RNG interface as other thermostats
- Accelerate RNG unit tests (fixes #3573)
    - they now take 2 seconds to run in coverage and sanitizer builds in CI
- Separate thermostats from integrators
    - better separation of concerns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants