Intermittent failures in `test_continuous_and_interrupted_simulations_equal` #1560

matt-graham · 2024-12-20T09:48:18Z

As noticed and discussed in #1527 and previously in #1507 (comment) the test for consistency of simulations run continuously and with suspend/resume in tests/test_simulation.py::test_continuous_and_interrupted_simulations_equal is intermittently and non-reproducibly failing

Failing tests workflow run on #1507

tests.test_simulation.test_continuous_and_interrupted_simulations_equal[83563095832589325021]

tests/test_simulation.py:295: in test_continuous_and_interrupted_simulations_equal
    _check_parsed_logs_are_equal(
tests/test_simulation.py:263: in _check_parsed_logs_are_equal
    assert module_logs_1[key].equals(module_logs_2[key])
E   assert False
E    +  where False = equals(         date  ALRI_Enterobacteriaceae  ...  stroke  year\n0  2010-06-30                      0.0  ...  0.0000  2010\n1  2010-06-30                      0.0  ...  0.0000  2010\n2  2010-06-30                      0.0  ...  0.0000  2010\n3  2010-06-30                      0.0  ...  0.0000  2010\n4  2010-06-30                      0.0  ...  0.0000  2010\n5  2010-06-30                      0.0  ...  0.0000  2010\n6  2010-06-30                      0.0  ...  0.0000  2010\n7  2010-06-30                      0.0  ...  0.0000  2010\n8  2010-06-30                      0.0  ...  0.0000  2010\n9  2010-06-30                      0.0  ...  0.0000  2010\n10 2010-06-30                      0.0  ...  0.0000  2010\n11 2010-06-30                      0.0  ...  0.0000  2010\n12 2010-06-30                      0.0  ...  0.0000  2010\n13 2010-06-30                      0.0  ...  0.0000  2010\n14 2010-06-30                      0.0  ...  0.0000  2010\n15 2010-06-30                      0.0  ...  0.0000  2010\n16 2010-06-30                      0.0  ...  0.0000  2010\n17 2010-06-30                      0.0  ...  0.0000  2010\n18 2010-06-30                      0.0  ...  0.0000  2010\n19 2010-06-30                      0.0  ...  0.0000  2010\n20 2010-06-30                      0.0  ...  0.0000  2010\n21 2010-06-30                      0.0  ...  0.0000  2010\n22 2010-06-30                      0.0  ...  0.0000  2010\n23 2010-06-30                      0.0  ...  0.0000  2010\n24 2010-06-30                      0.0  ...  0.0000  2010\n25 2010-06-30                      0.0  ...  0.0000  2010\n26 2010-06-30                      0.0  ...  0.0000  2010\n27 2010-06-30                      0.0  ...  0.0000  2010\n28 2010-06-30                      0.0  ...  0.0000  2010\n29 2010-06-30                      0.0  ...  0.0525  2010\n30 2010-06-30                      0.0  ...  0.0000  2010\n31 2010-06-30                      0.0  ...  0.0000  2010\n32 2010-06-30                      0.0  ...  0.0000  2010\n33 2010-06-30                      0.0  ...  0.0000  2010\n34 2010-06-30                      0.0  ...  0.0000  2010\n35 2010-06-30                      0.0  ...  0.0000  2010\n36 2010-06-30                      0.0  ...  0.0000  2010\n37 2010-06-30                      0.0  ...  0.0000  2010\n38 2010-06-30                      0.0  ...  0.0000  2010\n39 2010-06-30                      0.0  ...  0.0000  2010\n40 2010-06-30                      0.0  ...  0.0000  2010\n41 2010-06-30                      0.0  ...  0.0000  2010\n\n[42 rows x 47 columns])
E    +    where equals =          date  ALRI_Enterobacteriaceae  ...  stroke  year\n0  2010-06-30                      0.0  ...  0.0000  2010\n1  2010-06-30                      0.0  ...  0.0000  2010\n2  2010-06-30                      0.0  ...  0.0000  2010\n3  2010-06-30                      0.0  ...  0.0000  2010\n4  2010-06-30                      0.0  ...  0.0000  2010\n5  2010-06-30                      0.0  ...  0.0000  2010\n6  2010-06-30                      0.0  ...  0.0000  2010\n7  2010-06-30                      0.0  ...  0.0000  2010\n8  2010-06-30                      0.0  ...  0.0000  2010\n9  2010-06-30                      0.0  ...  0.0000  2010\n10 2010-06-30                      0.0  ...  0.0000  2010\n11 2010-06-30                      0.0  ...  0.0000  2010\n12 2010-06-30                      0.0  ...  0.0000  2010\n13 2010-06-30                      0.0  ...  0.0000  2010\n14 2010-06-30                      0.0  ...  0.0000  2010\n15 2010-06-30                      0.0  ...  0.0000  2010\n16 2010-06-30                      0.0  ...  0.0000  2010\n17 2010-06-30                      0.0  ...  0.0000  2010\n18 2010-06-30                      0.0  ...  0.0000  2010\n19 2010-06-30                      0.0  ...  0.0000  2010\n20 2010-06-30                      0.0  ...  0.0000  2010\n21 2010-06-30                      0.0  ...  0.0000  2010\n22 2010-06-30                      0.0  ...  0.0000  2010\n23 2010-06-30                      0.0  ...  0.0000  2010\n24 2010-06-30                      0.0  ...  0.0000  2010\n25 2010-06-30                      0.0  ...  0.0000  2010\n26 2010-06-30                      0.0  ...  0.0000  2010\n27 2010-06-30                      0.0  ...  0.0000  2010\n28 2010-06-30                      0.0  ...  0.0000  2010\n29 2010-06-30                      0.0  ...  0.0525  2010\n30 2010-06-30                      0.0  ...  0.0000  2010\n31 2010-06-30                      0.0  ...  0.0000  2010\n32 2010-06-30                      0.0  ...  0.0000  2010\n33 2010-06-30                      0.0  ...  0.0000  2010\n34 2010-06-30                      0.0  ...  0.0000  2010\n35 2010-06-30                      0.0  ...  0.0000  2010\n36 2010-06-30                      0.0  ...  0.0000  2010\n37 2010-06-30                      0.0  ...  0.0000  2010\n38 2010-06-30                      0.0  ...  0.0000  2010\n39 2010-06-30                      0.0  ...  0.0000  2010\n40 2010-06-30                      0.0  ...  0.0000  2010\n41 2010-06-30                      0.0  ...  0.0000  2010\n\n[42 rows x 47 columns].equals

Failing tests workflow run on #1527

tests.test_simulation.test_continuous_and_interrupted_simulations_equal[83563095832589325021]

tests/test_simulation.py:295: in test_continuous_and_interrupted_simulations_equal
    _check_parsed_logs_are_equal(
tests/test_simulation.py:263: in _check_parsed_logs_are_equal
    assert module_logs_1[key].equals(module_logs_2[key])
E   assert False
E    +  where False = equals(        date        AIDS      COPD  ...  Transport Injuries  li_wealth    year\n0 2010-06-30  123.856993  0.964876  ...            0.109333        1.0  2010.0\n1 2010-06-30  129.809534  0.400000  ...            0.054583        2.0  2010.0\n2 2010-06-30    5.864957  0.721667  ...           77.338475        3.0  2010.0\n3 2010-06-30   58.477938  1.010000  ...            0.088167        4.0  2010.0\n4 2010-06-30   82.703367  1.138333  ...            0.071333        5.0  2010.0\n\n[5 rows x 28 columns])
E    +    where equals =         date        AIDS      COPD  ...  Transport Injuries  li_wealth    year\n0 2010-06-30  123.856993  0.964876  ...            0.109333        1.0  2010.0\n1 2010-06-30  129.809534  0.400000  ...            0.054583        2.0  2010.0\n2 2010-06-30    5.864957  0.721667  ...           77.338475        3.0  2010.0\n3 2010-06-30   58.477938  1.010000  ...            0.088167        4.0  2010.0\n4 2010-06-30   82.703367  1.138333  ...            0.071333        5.0  2010.0\n\n[5 rows x 28 columns].equals

Both failures appear to be specifically due to differences in the log output across the simulations but the specific log key the difference is being seen in appears to differ and unfortunately they don't appear to differ in the summary output shown so it's not clear if this can help narrow down where the difference might be arising.

The text was updated successfully, but these errors were encountered:

matt-graham · 2024-12-20T10:30:53Z

Out of 12 runs of test locally on current master (05d3679) currently had one failure which is similar to frequency @tamuri mentioned in #1527 (comment)

For this specific failure, the difference is in the logs for key dalys_by_wealth_stacked_by_age_and_time with one entry in log dataframes differing corresponding to value for Schistosomiasis on 2010-06-30 with module_logs_1[key].compare(module_logs_2[key]) reporting

  Schistosomiasis
             self     other
2        0.132833  0.132833

with the precision there not sufficient to see difference. Looking at the full values we have that one is 0.1328333333333333 and the other 0.13283333333333333 so differing in only the last decimal place! Not clear if all of failures are due to such tiny differences as appears that not always occurring for same log key from above early failures. This level of difference could arise for example though from non-associativity of floating point addition / accumulating values in different orders. I'm not 100% sure that Pandas (or NumPy under the hood) doesn't for example use multithreading to parallelise reductions which might lead to such differences. Will continue investigating.

tamuri · 2024-12-20T14:52:54Z

`module_logs_1[key].compare(module_logs_2[key]

I think you mean equals?

module_logs_1[key] returns a dataframe, so we might be better off using assert_frame_equal(module_logs_1[key], module_logs_2[key]) instead, which properly handles floating-point.

matt-graham · 2024-12-20T15:20:51Z

I think you mean equals?

Nope I did mean compare, DataFrame.equals returns a bool, compare returns a DataFrame showing differences.

module_logs_1[key] returns a dataframe, so we might be better off using assert_frame_equal(module_logs_1[key], module_logs_2[key]) instead, which properly handles floating-point.

I think we probably want to switch to pandas.testing.assert_frame_equal anyway as it gives more informative error messages on failure (specifically showing the differing entries), and have switched to that in branch I'm using locally for testing but I'm currently using check_exact = True, which gives same behaviour as DataFrame.equals as it's not clear to me that using a relaxed floating point equality check within some tolerances is the correct approach here. If everything is determinstic floating point values should also be exactly equal. What I am not sure however is if there is some underlying non-determinism due to something like multithreading of reductions. If so a small tolerance may make sense, but it's not clear to me if that is cause yet, particularly given this is so intermittent and only showing in very small differences in individual values, and I would have thought something like this would be reflected more broadly.

I'm looping running the test with PYTHONHASHSEED set to 1 through to 40 (to try to see if I can reproduce error with fixed hash seed as if so may be due to set ordering somewhere) and so far up to 37 with no failures of test.

tamuri · 2024-12-20T16:28:45Z

Yeah, it's difficult to imagine this is floating-point error. I think the float value is consistent (797/6000) in runs but the printing of the float is inconsistent, sometimes printing 16 decimal places rather than 17.

matt-graham · 2024-12-20T16:45:10Z

Not sure if it's all that helpful but I seem to consistently (at least have over three tries now) get a failure with fixed hash seed of 44, that is running

PYTHONHASHSEED=44 pytest tests/test_simulation.py::test_continuous_and_interrupted_simulations_equal

I've had a look through schisto module and I can see once instance of a set which isn't sorted in

TLOmodel/src/tlo/methods/schisto.py

Lines 1004 to 1006 in 05d3679

    
           beneficiaries_still_alive = list(set(self.beneficiaries_ids).intersection( 
        
               self.sim.population.props.index[self.sim.population.props.is_alive] 
        
           ))

but changing list to sorted there doesn't seem to resolve issue here as still get the same test failure with this change.

matt-graham · 2024-12-20T16:55:15Z

Checked the bytes representation of the two values and it does look like they are actually different, not just an artefact of rounding when printing:

(Pdb) module_logs_1[key]["Schistosomiasis"].loc[2].tobytes().hex()
'661f3ec3ae00c13f'
(Pdb) module_logs_2[key]["Schistosomiasis"].loc[2].tobytes().hex()
'671f3ec3ae00c13f'

matt-graham · 2024-12-20T17:02:50Z

Oh wait I possibly misinterpreted

but the printing of the float is inconsistent, sometimes printing 16 decimal places rather than 17.

as I was forgetting this is a value parsed from log file where it will have been printed out. So I guess I should check if the dataframe the log records are created from shows the same discrepancy before being written to file

matt-graham · 2024-12-20T17:16:26Z

Okay have checked and the dataframes returned by self.get_dalys(yld=yld_by_wealth, yll=yll_by_wealth) which is then passed to a function log_df_line_by_line to write to log in

TLOmodel/src/tlo/methods/healthburden.py

Lines 522 to 530 in 05d3679

    
           log_df_line_by_line( 
        
               key='dalys_by_wealth_stacked_by_age_and_time', 
        
               description='DALYS, by the labels are that are declared for each cause_of_death and cause_of_disability' 
        
                           ', broken down by year and wealth category.' 
        
                           'Stacking by age and time: i.e., all the year of life lost ' 
        
                           'are ascribed to the age of the death and the year of the death.', 
        
               df=self.get_dalys(yld=yld_by_wealth, yll=yll_by_wealth), 
        
               force_cols=self._causes_of_dalys,

do differ in the same value ("Schistosomiasis" column, row index 2) with the same differing byte representations for continuous and suspended/resumed runs as above.

tamuri · 2024-12-20T17:56:43Z

Great hunting. I flipped my suspicion after you said setting hash seed makes the test fail consistently.

matt-graham · 2024-12-20T18:01:31Z

Tracked this back a bit further to the difference specifically arising in

self.years_lived_with_disability.loc["F", "0-4", 3, 2010]["Schistosomiasis"]

with values 0.03333333333333333/111111111111a13f (continuous) and 0.033333333333333326/101111111111a13f (suspend/resume).

The logic however for how HealthBurdern.years_lived_with_disability is updated is however pretty complex so not sure I can track back further to why we are getting this (very small) discrepancy.

tamuri · 2024-12-21T11:25:29Z

Numbers would be coming from schisto module's report_daly_values. For next year!

matt-graham changed the title ~~Intermittent failures in tests/test_simulation.py::test_continuous_and_interrupted_simulations_equal~~ Intermittent failures in test_continuous_and_interrupted_simulations_equal Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failures in `test_continuous_and_interrupted_simulations_equal` #1560

Intermittent failures in `test_continuous_and_interrupted_simulations_equal` #1560

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024 •

edited

Loading

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024 •

edited

Loading

matt-graham commented Dec 20, 2024

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024

tamuri commented Dec 21, 2024

Intermittent failures in test_continuous_and_interrupted_simulations_equal #1560

Intermittent failures in test_continuous_and_interrupted_simulations_equal #1560

Comments

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024 • edited Loading

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024

matt-graham commented Dec 20, 2024 • edited Loading

matt-graham commented Dec 20, 2024

tamuri commented Dec 20, 2024

matt-graham commented Dec 20, 2024

tamuri commented Dec 21, 2024

Intermittent failures in `test_continuous_and_interrupted_simulations_equal` #1560

Intermittent failures in `test_continuous_and_interrupted_simulations_equal` #1560

matt-graham commented Dec 20, 2024 •

edited

Loading

matt-graham commented Dec 20, 2024 •

edited

Loading