Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate intermittent test failure on NVHPC #3172

Open
JCGoran opened this issue Nov 4, 2024 · 1 comment
Open

Investigate intermittent test failure on NVHPC #3172

JCGoran opened this issue Nov 4, 2024 · 1 comment
Labels

Comments

@JCGoran
Copy link
Contributor

JCGoran commented Nov 4, 2024

Occasionally we get some CVODE test failures on NVHPC (see the full log on BB5 here):

208/398 Test  #22: pytest_coreneuron::basic_tests_py3.11 ..............................................***Failed   27.61 sec
_____________________________ test_t13[cvode-3-t] ______________________________

chk = <neuron.tests.utils.checkresult.Chk object at 0x2aaaad5d4f90>
t13_model_data = {'method': 'cvode', 1: {'Cell[0]': {'t': [0.0, 0.002178270042660733, 0.004356540085321466, 0.010144251293845802, 0.015...: [-65.0, -64.12990281759853, -63.26108085514461, -60.958494540260396, -58.66371131169468, -56.377246188972094, ...]}}}
field = 't', threads = 3

    @pytest.mark.parametrize("field", ["t", "v"])
    @pytest.mark.parametrize("threads", thread_values)
    def test_t13(chk, t13_model_data, field, threads):
        """hh model, testing fixed step and cvode with threads.
    
        This used to be t13.hoc in nrntest/fast.
    
        See t13_model_data for the actual model and see
        compare_time_and_voltage_trajectories for explanation of how the results
        are validated, including why the thresholds for 1 and 3 threads below are
        probing different things."""
    
        method = t13_model_data["method"]  # cvode or fixed
        # Determine the relative tolerance we can accept
        tolerance = 0.0
        if method == "fixed" and field == "v" and threads == 1:
            tolerance = 1e-10
        elif method.startswith("cvode"):
            if field == "t":
                tolerance = 5e-8
            elif field == "v":
                tolerance = 6e-7
    
>       compare_time_and_voltage_trajectories(
            chk, t13_model_data, field, threads, "t13", tolerance
        )

test/pytest_coreneuron/test_nrntest_fast.py:222: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

chk = <neuron.tests.utils.checkresult.Chk object at 0x2aaaad5d4f90>
model_data = {'method': 'cvode', 1: {'Cell[0]': {'t': [0.0, 0.002178270042660733, 0.004356540085321466, 0.010144251293845802, 0.015...: [-65.0, -64.12990281759853, -63.26108085514461, -60.958494540260396, -58.66371131169468, -56.377246188972094, ...]}}}
field = 't', threads = 3, name = 'Cell[2]', tolerance = 5e-08

    def compare_time_and_voltage_trajectories(
        chk, model_data, field, threads, name, tolerance
    ):
        """Compare time and voltage trajectories for several cells.
    
        Arguments:
          chk: handle to JSON reference data loaded from disk
          model_data: the data to compare, this is a nested dict with structure
                      model_data[thread_count][cell_name][field] = list(values)
          field: which field (time or voltage) to compare
          threads: which thread count to compare
          name: name of the test (t13 or t14, for now), used to access chk
          tolerance: relative tolerance for the fuzzy comparison between values
    
        This is used to implement both test_t13 and test_t14. Different fields and
        numbers of threads are compared somewhat differently, as follows:
    
        - if threads == 1, the reference data are loaded from JSON via chk. If no
          data are loaded, no comparison is done, but the data from the current
          run are saved to disk (for use as a future reference run). This means
          that the comparisons with threads == 1 are sensitive to differences
          between different compilers, architectures, optimisation settings etc.
        - for threads != 1, the results with threads == 1 are used as a reference.
          This means that these comparisons are sensitive to differences in
          summation order and rounding due to different orderings of floating
          point operations etc.
    
        Furthermore, there is special handling for the dependent variable (v), to
        reduce the need for generous tolerances. Because, when cvode is used,
        there are small differences in the time values between the reference data
        and the new data, it is expected that (particularly when the voltage is
        changing rapidly) this will generate differences in the voltage values. To
        mitigate this, the new voltage values are interpolated to match the time
        values from the reference data before comparison."""
        method = model_data["method"]  # cvode or fixed
    
        # Determine which data we will use as a reference
        if threads == 1:
            # threads=1: compare to reference from JSON file on disk
            key = name + ":"
            if method.startswith("cvode"):
                key += "cvode"
            else:
                key += method
            ref_data = chk.get(key, None)
            if ref_data is None:
                # No comparison to be done; store the data as a new reference
                chk(key, model_data[threads])
                return
        else:
            # threads>1: compare to threads=1 from this test execution
            ref_data = model_data[1]
    
        # Compare `field` in `this_data` with `ref_data` and `tolerance`
        this_data = model_data[threads]
    
        if field == "v":
            # If the t values don't match then it is expected that the v values
            # won't either, particularly when the voltage is changing rapidly.
            # Interpolate the new v values to match the reference t values to
            # mitigate this.
            def interp(new_t, old_t, old_v):
                assert np.all(np.diff(old_t) > 0)
                return np.interp(new_t, old_t, old_v)
    
            new_data = {}
            for name, data in this_data.items():
                ref_t = ref_data[name]["t"]
                raw_t, raw_v = data["t"], data["v"]
                assert len(raw_t) == len(ref_t)
                assert len(raw_v) == len(ref_t)
                new_v = interp(ref_t, raw_t, raw_v)
                new_data[name] = {"v": new_v}
            this_data = new_data
    
        # Finally ready to compare
        assert this_data.keys() <= ref_data.keys()
        max_diff = 0.0
        for name in this_data:  # cell name
            # Pick out the field we're comparing
            these_vals = this_data[name][field]
            ref_vals = ref_data[name][field]
            assert len(these_vals) == len(ref_vals)
            for a, b in zip(these_vals, ref_vals):
                match = math.isclose(a, b, rel_tol=tolerance)
                if match:
                    continue
                diff = abs(a - b) / max(abs(a), abs(b))
                max_diff = max(diff, max_diff)
        if max_diff > tolerance:
>           raise Exception("max diff {} > {}".format(max_diff, tolerance))
E           Exception: max diff 5.088799056284889e-08 > 5e-08

The Spack package spec seems to be:

~rx3d~caliper+gpu+coreneuron~legacy-unit~openmp+shared+sympy+tests~unified build_type=FastDebug model_tests=channel-benchmark,olfactory,tqperf-heavy

Version info:

  • NVHPC 23.1
  • gcc 12.3.0
@JCGoran JCGoran added the bug label Nov 4, 2024
@matz-e
Copy link
Collaborator

matz-e commented Nov 5, 2024

We should update the NVHPC to at least 24.9 - fixes the optimized build failures, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants