Thermostat noise for different seeds is correlated #3585

fweik · 2020-03-17T09:21:13Z

Currently the thermostat noise for different seeds is correlated, the seed just shifts the sequence.
Currently the seed is used to offset the counter, but it should be used in the key to get independent
sequences. Also there should only be one counter...

jngrad · 2020-03-17T13:00:38Z

We could use the user-provided seed value as the offset of a Philox counter and take the random number at that index as the seed for the thermostats.

KaiSzuttor · 2020-03-17T13:12:43Z

we should start with writing a failing test

fweik · 2020-03-17T13:17:37Z

The obvious and simple solution is to use the seed as part of the key. This would also allow to just use one counter for all the RNGs (as a global propagation parameter).

fweik · 2020-03-17T13:30:40Z

Also the salt should be part of the key, not the counter....

fweik · 2020-03-17T13:44:04Z

I'll propose a fix.

fweik · 2020-03-17T16:21:31Z

There is a sketch in https://github.com/fweik/espresso/tree/counter, which this the thermostats can be turned into proper classes, because seed just a regular parameter, which allows for removing all of the coursed constructions around the thermostats. I'm not going to work anymore on this in my free time...

fweik · 2020-03-17T16:22:45Z

The branch lacks the python interface, which can also be considerably simplified. The counter needs to be added to the state (this is why the checkpointing tests fail), and the cython stuff needs to be adjusted.

jngrad · 2020-03-18T12:12:23Z

You've also added an extra seed argument to all functions in src/core/random.hpp, but that seed isn't used.

fweik · 2020-03-20T11:38:16Z

Ah right. I'll finish this next week.

jngrad · 2020-06-23T07:21:11Z

Any news on this? The proposed fix simplifies the core a lot and would be a nice addition to 4.1.3.

fweik · 2020-06-23T07:22:37Z

I forgot about this, sorry. I'll take a look and asses how much work this still is. Then we can decide on a course of action.

fweik · 2020-06-23T07:27:52Z

The plan here was to use a single monotonic counter as parametric time, which is propagated once per integration step? Does anybody see a problem with such a design? @RudolfWeeber?

jngrad · 2020-06-23T07:33:27Z

Looking at your WIP I don't see an issue on 4.2.0, but for 4.3.0 with walberla I'm not sure.

The only side effect I can think of in 4.2.0, is that disabling a thermostat and re-enabling it later with the same seed will produce a different trajectory, if I understand your code correctly. This will make reproducing results more difficult, unless we provide a mechanism to reset the global counter.

fweik · 2020-06-23T07:35:40Z

This will make reproducing results more difficult, unless we provide a mechanism to reset the global counter.

Apparently this is the desired behavior, so this is by design.

fweik · 2020-06-23T07:56:03Z

Looking at the thermostat code, I found more bugs.

fweik · 2020-06-28T15:16:11Z

I won't have time to work on this.

jngrad · 2020-06-28T18:37:14Z

Could you tell me more about the other thermostat bug you found? I'd be interested in having a look at it.

fweik · 2020-06-28T18:43:34Z

No unfortunately I don't remember. I should have taken notes.

fweik · 2020-06-28T18:58:19Z

Let's talk on Monday, I have looked into this, and in principle it is solvable. They old WIP is of no use anymore, because too much of the relevant code changed, I have started a new one, providing the infrastructure, e.g. exposing all bits of the key in the philox wrapper and adding the evolution parameter as a new global. This could now be used for the thermostats, but I looked at design
and I'm not sure if this should be touched before the design isn't improved further. E.g. it is not clear to me if and how the RNG seeds are checkpointed.

jngrad · 2020-08-31T17:19:26Z

we should start with writing a failing test

not a unit test, but illustrates the issue:

diff --git a/src/core/unit_tests/thermostats_test.cpp b/src/core/unit_tests/thermostats_test.cpp
index 45eecdce6..dc30b7142 100644
--- a/src/core/unit_tests/thermostats_test.cpp
+++ b/src/core/unit_tests/thermostats_test.cpp
@@ -330,3 +330,18 @@ BOOST_AUTO_TEST_CASE(test_npt_iso_randomness) {
   }
 }
 #endif // NPT
+
+BOOST_AUTO_TEST_CASE(test_autocorrelation) {
+  auto thermostat1 = thermostat_factory<LangevinThermostat>();
+  auto thermostat2 = thermostat_factory<LangevinThermostat>();
+  thermostat1.rng_initialize(42);
+  thermostat2.rng_initialize(43);
+  auto p = particle_factory();
+  for (int i = 0 ; i < 10; ++i) {
+    thermostat1.rng_increment();
+    thermostat2.rng_increment();
+    auto p1 = friction_thermo_langevin(thermostat1, p);
+    auto p2 = friction_thermo_langevin(thermostat2, p);
+    printf("t = %d p1[0] = %+6.3f p2[0] = %+6.3f\n", i, p1[0], p2[0]);
+  }
+}

output:

69: t = 0 p1[0] = +5.946 p2[0] = -5.307
69: t = 1 p1[0] = -5.307 p2[0] = -4.884
69: t = 2 p1[0] = -4.884 p2[0] = -0.353
69: t = 3 p1[0] = -0.353 p2[0] = -5.667
69: t = 4 p1[0] = -5.667 p2[0] = -4.504
69: t = 5 p1[0] = -4.504 p2[0] = +2.516
69: t = 6 p1[0] = +2.516 p2[0] = +0.286
69: t = 7 p1[0] = +0.286 p2[0] = -4.657
69: t = 8 p1[0] = -4.657 p2[0] = -5.201
69: t = 9 p1[0] = -5.201 p2[0] = -1.721

The only way to test this is to run an autocorrelation function, which isn't very convenient in a unit test. We should also check autocorrelation on the ordered random sequences to reveal bugs such as the NPT one that took me, Flo and Rudolf quite a while to figure out when refactoring the NPT code where we call the same random function twice in a single time step (this bug never made it into a PR, fortunately), but this would need some special treatment since ordered sequence are naturally autocorrelated... Or, we just check the random sequences match the Philox sequences as suggested by Flo.

jngrad · 2020-08-31T21:32:52Z

The bonded interactions checkpointing mechanism in set up in such a way that exceptions are silently ignored. This makes it really hard to track down a seed checkpointing bug I just ran into (maybe this is what @fweik mentioned above). The seed checkpointing mechanism is really cryptic.

--- a/src/python/espressomd/interactions.pyx
+++ b/src/python/espressomd/interactions.pyx
@@ -1720,6 +1720,10 @@ cdef class BondedInteraction:
 
     property params:
         def __get__(self):
+            print(">>> ", self._params)
+            if 'seed' in self._params and self._params['seed'] == 0:
+                print('The exception on the next line will be silently ignored')
+                raise TypeError()
             return self._params
 
         def __set__(self, p):
@@ -3358,10 +3362,14 @@ class BondedInteractions:
         params = {}
         for i, bonded_instance in enumerate(self):
             if hasattr(bonded_instance, 'params'):
+                print('Read parameters...')
                 params[i] = bonded_instance.params
+                print('Read successful')
                 params[i]['bond_type'] = bonded_instance.type_number()
             else:
+                print('Object has no parameters!')
                 params[i] = None
+        print('::: ', params)
         return params
 
     def __setstate__(self, params):

output when running into the seed checkpointing bug that resets the seed to 0:

Read parameters...
>>>  {'k': 1.0, 'r_0': 0.0, 'r_cut': 0.0}
Read successful
>>>  {'temp_com': 0.0, 'gamma_com': 0.0, 'temp_distance': 0.2, 'gamma_distance': 0.5, 'r_cut': 2.0, 'seed': 0}
The exception on the next line will be silently ignored
Object has no parameters!
:::  {0: {'k': 1.0, 'r_0': 0.0, 'r_cut': 0.0, 'bond_type': 1}, 1: None}
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

KaiSzuttor · 2020-09-01T06:49:46Z

why do you think that the __get__ method is called for checkpointing?

RudolfWeeber · 2020-09-01T07:46:45Z

I think w need the following checks. Uncorrelated results for: * 2 particles, same counter value * 1 particle, consecutive time steps * 2 particles with the same id with 2 rngs with consecutive seeds All of these are single value correlations. In addition, we need to check the mean and variance of the outcome for a single particle. That should be feasible, both in terms of runtime and complexity. The alternative is to * Separate out the particle->key mapping * Make the RNG and key generation injectable via templates into the thermostats * Make a fake rng with deterministic outcomes * Test with that

KaiSzuttor · 2020-09-01T07:55:27Z

wouldn't it be better to have a single implementation of the generation of random numbers which can be unit tested?

RudolfWeeber · 2020-09-01T08:17:32Z

wouldn't it be better to have a single implementation of the generation of random numbers which can be unit tested?

That would be the outcome of the steps I proposed in the second list. However, there is two parts: key generation and random number generation. Both would need to be tested. It is clear that that approach is better, but it is also much more work. Since the thermalization is quite a central part, it might be worth it, though. @jngrad, what would you prefer?

jngrad · 2020-09-01T09:32:44Z

why do you think that the __get__ method is called for checkpointing?

I looks like it's the mechanism currently used for checkpointing bonded interactions objects, but I could be wrong. From what I understood, the __get__ method of the params member is called on all objects to be checkpointed in the __getstate__ method of the BondedInteractions class that handles them. The error probably happens in the hasattr

espresso/src/python/espressomd/interactions.pyx

Lines 3357 to 3365 in 748e9d0

    
           def __getstate__(self): 
        
               params = {} 
        
               for i, bonded_instance in enumerate(self): 
        
                   if hasattr(bonded_instance, 'params'): 
        
                       params[i] = bonded_instance.params 
        
                       params[i]['bond_type'] = bonded_instance.type_number() 
        
                   else: 
        
                       params[i] = None 
        
               return params

It is clear that that approach is better, but it is also much more work.

I'll have a look once the refactor is ready. Right now we have 3 different thermostat implementations in the core, so it's not clear to me yet which path is feasible.

RudolfWeeber · 2020-09-01T09:52:06Z

But if we decide to pull out the RNG and key generation, then THAT is the refactoring that should be done now.
Or maybe I don't understand what you are attempting with your current refactor.

KaiSzuttor · 2020-09-01T09:59:18Z

it is not the first bug we encounter since we switched to the counter-based RNG and we are still not sure if it is worth to pull out the logic and write unit tests?

jngrad · 2020-09-01T13:41:34Z

it is not the first bug we encounter since we switched to the counter-based RNG and we are still not sure if it is worth to pull out the logic and write unit tests?

All functions generating uniform and gaussian noise are checked for correlation in src/core/unit_tests/random_test.cpp. During this refactor I've added an extra check for autocorrelation at lag time n for seeds offset by n to capture the bug in this ticket. It's not as good as checking all lag times with a full autocorrelation function, but I don't think we need to check that case (Philox itself should have tests regarding autocorrelation) and we could only do that in a python test. There are already tests for thermostats in src/core/unit_tests/thermostats_test.cpp that check the correlation in every thermostat noise functions. The thermostats are created locally in the test using a template factory. We can pull out the random code into a library outside of src/core, but this will only pull out random.hpp and the BaseThermostat class of thermostat.hpp. The rest of thermostat code is too coupled to the core because of the physics behind it, e.g. it provides methods to calculate prefactors derived from the gamma friction coefficients.

wouldn't it be better to have a single implementation of the generation of random numbers which can be unit tested?

The RNG implementation is currently made of free functions in random.hpp that take a counter, a seed and two keys as arguments. Once I'm done with this ticket, there will be only one counter and each thermostat class will only store a seed and provide methods to calculate prefactors from gammas.

Separate out the particle->key mapping

I'm not sure what you mean here. My understanding is that we need the key mechanism to provide the same random numbers to the same particle id pair.

RudolfWeeber · 2020-09-01T13:53:49Z

By “key generation”, I mean: how a particle or a pair of particles maps to a key for the RNG. I think, there are two kinds: * Single particle (Langevin, SD, Brownian) * Pair of particle (DPD, thermalized bond)

jngrad · 2020-09-01T14:06:48Z

At the moment we use the same function for the single particles and particle pairs. In the particle pair case, either the particle ids order matters (thermalized bond), or it doesn't matter and the ids are sorted (DPD). We could refactor the free functions in random.hpp to take a key as argument instead of particle ids, and write a method that derives that key from the particle ids in the thermostat structs. That would help with abstraction in random.hpp and hide implementation details of the key derivation in dpd.cpp and thermalized_bond.hpp.

jngrad · 2020-09-01T19:29:50Z

Alright the WIP is in jngrad:thermostats-global-counter. All thermostats now derive from BaseThermostat and depend on a global, monotonically-increasing counter (even Stokesian Dynamics) that cannot be altered by the user, but gets checkpointed. All sources of correlation we know of should be gone. All thermostats share the same interface, except for LB GPU.

We can now decide whether we want to abstract the key generation for single particles/particle pairs, replace thermostat unit tests by statistical python tests, pull the free functions for random number generation out of the core, refactor the python class for thermostats (e.g. like in integrate.pyx), split the core classes for thermostats in separates files so that they don't know about each other, etc.

jngrad · 2020-09-02T09:06:24Z

By the way, the boost::optional guard around the RNG seed seems unnecessary to me. It looks like it only helps in the python interface, to spare the user from providing the same seed again when updating the thermostat temperature. This could be handled in the python class directly by storing the seed. Better yet, we could make it mandatory to provide the same seed again. Users should no longer use warmup, but instead rely on the steepest descent algorithm. If warmup is really needed, we could look into creating a method to update the temperature global. This would reduce code duplication in the python Thermostat class. We could even think about having a single global temperature, instead of two (temperature and sd_kT).

This boost::optional forces us to write two implementations of the BaseThermostat class for LB GPU because of bad compiler support with Clang 9 and CUDA 10, and we can't use std::optional (C++17 feature). Furthermore, keeping the seed in memory even after we "clear" the thermostat list with system.thermostat.turn_off() means all thermostat objects have the lifetime of the simulation. Without this optional, we could engineer a better thermostat infrastructure.

RudolfWeeber · 2020-09-02T09:46:32Z

Not sure about the optional, probably, we need to look at it in a meeting. ``temperature` is one of the globals which should go away. As Flo pointed out, the system doesn’t have a temperature. Algorithms which use one as parameter, should store it themselves.

jngrad · 2020-09-03T19:52:15Z

As Flo pointed out, the system doesn’t have a temperature. Algorithms which use one as parameter, should store it themselves.

Good point. Let's discuss the thermostat infrastructure in the next meeting.

Going back to the original issue, here is a MWE to reproduce the bug in a simulation:

import espressomd
system = espressomd.System(box_l=[1.0, 1.0, 1.0])
system.cell_system.skin = 0
system.time_step = 0.01
system.part.add(pos=[0, 0, 0], ext_force=[0, 0, -1])
kT = 1.1
gamma = 3.5

print("seed = 42")
system.thermostat.set_langevin(kT=kT, gamma=gamma, seed=42)
for i in range(4):
    system.integrator.run(1)
    print(system.part[0].pos if i else '...skip...')
    system.part[0].pos = [0, 0, 0]
    system.part[0].v = [0, 0, 0]

print("seed = 43")
system.thermostat.set_langevin(kT=kT, gamma=gamma, seed=43)
for i in range(3):
    system.integrator.run(1)
    print(system.part[0].pos)
    system.part[0].pos = [0, 0, 0]
    system.part[0].v = [0, 0, 0]

output:

seed = 42
...skip...
[ 0.00238167  0.00132846 -0.0004863 ]
[-0.00212575  0.00104275 -0.00153655]
[-0.00195621 -0.00193643 -0.00231657]
seed = 43
[ 0.00238167  0.00132846 -0.0004863 ]
[-0.00212575  0.00104275 -0.00153655]
[-0.00195621 -0.00193643 -0.00231657]

Although the random sequences are shifted between two simulations, the initial conditions (particle positions and velocities) are necessarily different, thus if at least two particles in the system are within interaction range, the initial forces will be different in Langevin Dynamics. Maybe this RNG correlation could be an issue in Stokesian Dynamics and Brownian Dynamics simulations?

Description of changes: - Remove RNG correlation stemming from seed offsets (fixes #3585) - seeds are now used as keys - a monotonically increasing counter is used in each thermostat - the only way to reset these counters is to create a new `System` - Remove RNG correlation stemming from resetting `sim_time` or `time_step` during simulations with SD (fixes #3840) - the SD thermostat now uses the same RNG interface as other thermostats - Accelerate RNG unit tests (fixes #3573) - they now take 2 seconds to run in coverage and sanitizer builds in CI - Separate thermostats from integrators - better separation of concerns

fweik added Bug Core labels Mar 17, 2020

fweik self-assigned this Mar 17, 2020

fweik removed their assignment Mar 17, 2020

fweik self-assigned this Mar 23, 2020

fweik removed their assignment Jun 28, 2020

jngrad added this to the Espresso 4.2 milestone Jun 29, 2020

This was referenced Sep 3, 2020

Use global counter in thermostats #3884

Draft

Refactor thermostat infrastructure #3888

Merged

kodiakhq bot closed this as completed in #3888 Sep 18, 2020

jngrad mentioned this issue Jan 12, 2024

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

Closed

Thermostat noise for different seeds is correlated #3585

Thermostat noise for different seeds is correlated #3585

Comments

fweik commented Mar 17, 2020

jngrad commented Mar 17, 2020

KaiSzuttor commented Mar 17, 2020

fweik commented Mar 17, 2020

fweik commented Mar 17, 2020

fweik commented Mar 17, 2020

fweik commented Mar 17, 2020

fweik commented Mar 17, 2020

jngrad commented Mar 18, 2020

fweik commented Mar 20, 2020

jngrad commented Jun 23, 2020

fweik commented Jun 23, 2020

fweik commented Jun 23, 2020

jngrad commented Jun 23, 2020

fweik commented Jun 23, 2020

fweik commented Jun 23, 2020

fweik commented Jun 28, 2020

jngrad commented Jun 28, 2020

fweik commented Jun 28, 2020

fweik commented Jun 28, 2020

jngrad commented Aug 31, 2020

jngrad commented Aug 31, 2020

KaiSzuttor commented Sep 1, 2020

RudolfWeeber commented Sep 1, 2020 via email

KaiSzuttor commented Sep 1, 2020 • edited Loading

RudolfWeeber commented Sep 1, 2020 via email

jngrad commented Sep 1, 2020

RudolfWeeber commented Sep 1, 2020

KaiSzuttor commented Sep 1, 2020

jngrad commented Sep 1, 2020

RudolfWeeber commented Sep 1, 2020 via email

jngrad commented Sep 1, 2020

jngrad commented Sep 1, 2020

jngrad commented Sep 2, 2020

RudolfWeeber commented Sep 2, 2020 via email

jngrad commented Sep 3, 2020

KaiSzuttor commented Sep 1, 2020 •

edited

Loading