TrickOps: Update send_hs.py to work with latest Trick output (#1810)

* Update send_hs.py to work with the new fields Trick's output produces * Add unit tests and doctests for send_hs module * Update TrickOps documentation to include info on send_hs usage * Remove MonteCarloGenerationhelper.py's unecessary dependency on send_hs Unrelated: * Update pip when testing TrickOps per pip best practices Closes #1807 Co-authored-by: Dan Jordan <daniel.d.jordan@nasa.gov>
nasa · Dec 12, 2024 · 16828a9 · 16828a9
1 parent 1330c42
commit 16828a9
Show file tree

Hide file tree

Showing 11 changed files with 574 additions and 58 deletions.
diff --git a/.github/workflows/trickops.yml b/.github/workflows/trickops.yml
@@ -20,7 +20,7 @@ jobs:
       - name: create virtual environment
         run: |
           cd share/trick/trickops/
-          python3 -m venv .venv && source .venv/bin/activate && pip3 install -r requirements.txt
+          python3 -m venv .venv && source .venv/bin/activate && pip3 install --upgrade pip && pip3 install -r requirements.txt
       - name: get and build koviz
         run: |
           cd /tmp/ && wget -q https://github.com/nasa/koviz/archive/refs/heads/master.zip && unzip master.zip

diff --git a/docs/documentation/miscellaneous_trick_tools/TrickOps.md b/docs/documentation/miscellaneous_trick_tools/TrickOps.md
@@ -423,6 +423,57 @@ if ret == 0:  # Successful generation
 Note that the number of runs to-be-generated is configured somewhere in the `input.py` code and this module cannot robustly know that information for any particular use-case. This is why `monte_dir` is a required input to several functions - this directory is processed by the module to understand how many runs were generated.
 
 
+## `send_hs` - TrickOps Helper Class for Parsing Simulation Diagnostics
+
+Each Trick simulation run directory contains a number of Trick-generated metadata files, one of which is the `send_hs` file which represents all output during the simulation run sent through the Trick "(h)ealth and (s)tatus" messaging system. At the end of the `send_hs` message, internal diagnostic information is printed by Trick that looks a lot like this:
+
+```
+|L   0|2024/11/21,15:54:20|myworkstation| |T 0|68.955000|
+     REALTIME SHUTDOWN STATS:
+            ACTUAL INIT TIME:       42.606
+         ACTUAL ELAPSED TIME:       55.551
+|L   0|2024/11/21,15:54:20|myworkstation| |T 0|68.955000|
+SIMULATION TERMINATED IN
+  PROCESS: 0
+  ROUTINE: Executive_loop_single_thread.cpp:98
+  DIAGNOSTIC: Reached termination time
+
+              SIMULATION START TIME:        0.000
+               SIMULATION STOP TIME:       68.955
+            SIMULATION ELAPSED TIME:       68.955
+                 USER CPU TIME USED:       55.690
+               SYSTEM CPU TIME USED:        0.935
+              SIMULATION / CPU TIME:        1.218
+       INITIALIZATION USER CPU TIME:       42.783
+     INITIALIZATION SYSTEM CPU TIME:        0.901
+               SIMULATION RAM USAGE:     1198.867MB
+      (External program RAM usage not included!)
+  VOLUNTARY CONTEXT SWITCHES (INIT):          792
+INVOLUNTARY CONTEXT SWITCHES (INIT):          187
+   VOLUNTARY CONTEXT SWITCHES (RUN):           97
+ INVOLUNTARY CONTEXT SWITCHES (RUN):           14
+```
+
+The information provided here is a summary of how long the simulation ran, in both wall-clock time, and cpu time, both for initialization and run-time, and other useful metrics like how much peak RAM was used during run-time execution. Tracking this information can be useful for Trick-using groups so TrickOps provides a utility class for parsing this data. Here's an example of how you might use the `send_hs` module:
+
+```python
+import send_hs  # Import the module
+# Instantiate a send_hs instance, reading the given send_hs file
+shs = send_hs.send_hs("path/to/SIM_A/RUN_01/send_hs")
+start_time      = shs.get('SIMULATION START TIME')  # Get the value of sim start time
+stop_time       = shs.get('SIMULATION STOP TIME')   # Get the value of sim stop time
+realtime_ratio  = shs.get('SIMULATION / CPU TIME')  # Get the realtime ratio (how fast the sim ran)
+
+# Instead of getting diagnostics individually, you can ask for the full dictionary
+diagnostics     = shs.get_diagnostics()
+# Print the RAM usage from the dictionary
+print(diagnostics['SIMULATION RAM USAGE'])
+```
+
+Plotting this data for regression scenarios can be quite useful - a lot of groups would want to know if the sim slowed down significantly and if so, when in the history this occurred.  If you are already invested in Jenkins CI, you might be interested in using the `send_hs` module in conjunction with the [Jenkins plot plugin](https://plugins.jenkins.io/plot/). Here's an example of what tracking sim realtime ratio over time looks like in a workflow with about 15 regression scenarios all shown on a single plot:
+
+![ExampleWorkflow In Action](images/sim_speed_example.png)
+
 ## More Information
 
 A lot of time was spent adding `python` docstrings to the modules in the `trickops/` directory and tests under the `trickops/tests/`. This README does not cover all functionality, so please see the in-code documentation and unit tests for more detailed information on the framework capabilities.

diff --git a/docs/documentation/miscellaneous_trick_tools/images/sim_speed_example.png b/docs/documentation/miscellaneous_trick_tools/images/sim_speed_example.png
diff --git a/share/trick/trickops/MonteCarloGenerationHelper.py b/share/trick/trickops/MonteCarloGenerationHelper.py
@@ -30,7 +30,6 @@
 
 import sys, os
 
-import send_hs
 import argparse, glob
 
 import subprocess, errno

diff --git a/share/trick/trickops/send_hs.py b/share/trick/trickops/send_hs.py
@@ -1,67 +1,254 @@
 import re, os
 import pdb
 
+# This global is the result of hours of frustration and debugging. This is only used by doctest
+# but appears to be the only solution to the problem of __file__ not being an absolute path in
+# some cases for some python versions and how that interacts with this class's os.chdir() when its
+# base class initializes. If you attempt to define this_trick locally in the doctest block,
+# which was my original attempt, you will find that the value of this_trick is different inside
+# the __init__ doctest evaluation compared to any other member function. I believe this is only
+# the case when using python version < 3.9, according to the information found here:
+#   https://note.nkmk.me/en/python-script-file-path/
+# I do not like adding globals to "production code" just to facilitate a testing mechanism, but
+# I don't know of any cleaner way way to do this.  -Jordan 12/2024
+this_trick = os.path.normpath(os.path.join(os.path.dirname(os.path.realpath(__file__)), '../../..'))
 class send_hs(object):
     """
-    Reads a file containing the send_hs output and returns a send_hs
-    object containing the values from that output
+    Utility class for parsing simulation diagnostic data at the end of a
+    Trick-generated send_hs output file.
     """
-    def __init__(self, hs_file):
+    def __init__(self, hs_file=None):
+        """
+        Initialize this instance.
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+
+        Parameters
+        ----------
+        hs_file : str
+            Path to the send_hs output to read and parse
+        """
         self.hs_file = hs_file
-        self.actual_init_time = None
-        self.actual_elapsed_time = None
-        self.start_time = None
-        self.stop_time = None
-        self.elapsed_time = None
-        self.actual_cpu_time_used = None
-        self.sim_cpu_time = None
-        self.init_cpu_time = None
-        self.parse()
-    def parse(self):
-        f = open(self.hs_file, 'r')
-        lines = f.readlines()
-        for line in lines:
-            self.actual_init_time = self.attempt_hs_match('ACTUAL INIT TIME',self.actual_init_time, line)
-            self.actual_elapsed_time = self.attempt_hs_match('ACTUAL ELAPSED TIME',self.actual_elapsed_time, line)
-            self.start_time = self.attempt_hs_match('SIMULATION START TIME',self.start_time, line)
-            self.stop_time = self.attempt_hs_match('SIMULATION STOP TIME',self.stop_time, line)
-            self.elapsed_time = self.attempt_hs_match('SIMULATION ELAPSED TIME',self.elapsed_time, line)
-            self.actual_cpu_time_used = self.attempt_hs_match('ACTUAL CPU TIME USED',self.actual_cpu_time_used, line)
-            self.sim_cpu_time = self.attempt_hs_match('SIMULATION / CPU TIME',self.sim_cpu_time, line)
-            self.init_cpu_time = self.attempt_hs_match('INITIALIZATION CPU TIME',self.init_cpu_time, line)
-         # TODO add capture of blade and DIAGNOSTIC: Reached termination time as success criteria
-
-    def attempt_hs_match(self, name, var, text):
+        self.num_lines = None
+        self._diagnostics = {}         # Internal dict of diagnostic keys and values
+        self._missing_diagnostics = [] # List of diagnostics we failed to find
+        for k in self.keys():
+            self._diagnostics[k] = None
+
+        if hs_file:
+            self.parse()
+
+    def is_valid(self):
+        """
+        Check for validity of the parsed send_hs file. If any expected internal members
+        were not able to be parsed, store them in missing_diagnostics and return False.
+        If everything was found, return True
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.is_valid()
+        True
+
+        Returns
+        -------
+        True if all expected internal members were parsed. False if any member wasn't found.
+
+        """
+        self._missing_diagnostics = [] # Reset the internal list
+        for d in self._diagnostics:
+            if self._diagnostics[d] is None:
+                self._missing_diagnostics.append(d)
+        if self._missing_diagnostics:
+            return False
+        else:
+            return True
+
+    def missing_diagnostics(self):
+        '''
+        Check for validity and return a list of any missing diagnostics that we were
+        unable to find in the send_hs output
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.missing_diagnostics()
+        []
+
+        Returns
+        -------
+        list
+             All diagnotics that were unable to be parsed
+        '''
+        self.is_valid()
+        return self._missing_diagnostics
+
+    def parse(self, hs_file=None):
+        '''
+        Parse the content of self.hs_file and assign internal variables for each field found
+        if hs_file is given, overwrite self.hs_file with it's value
+
+
+        >>> sh = send_hs()
+        >>> sh.parse(os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh2 = send_hs()
+        >>> sh2.parse()
+        Traceback (most recent call last):
+           ...
+        RuntimeError: send_hs file: 'None' cannot be read. You must provide a valid send_hs output file.
+
+        Parameters
+        ----------
+        hs_file : str
+            Path to hs_file to parse. If None, self.hs_file is used instead.
+
+        Raises
+        ------
+        RuntimeError
+            If no send_hs file was provided at construction or when calling this function
+        '''
+        if hs_file:
+            self.hs_file = hs_file
+
+        if not self.hs_file or not os.path.isfile(self.hs_file):
+            raise RuntimeError(f"send_hs file: '{self.hs_file}' cannot be read."
+             " You must provide a valid send_hs output file.")
+
+        self.num_lines = 0
+        with open(self.hs_file) as fp:
+          # Efficiency guard - we don't do an re pattern match until we've starting parsing the diagnostic msg
+          start_reading_diagnostics = False
+          for line in fp:
+            self.num_lines += 1
+            #  NOTE this "attempt match" approach is less efficient but it should be
+            # robust to a future ordering change of the lines from Trick's output.
+            if start_reading_diagnostics:
+              for d in self._diagnostics:
+                  if self._diagnostics[d] is None:
+                      self._diagnostics[d] = self._attempt_hs_match(d, line)
+            # This text precedes the diagnostics output, use it as the trigger to start parsing
+            if 'REALTIME SHUTDOWN STATS:' in line:
+              start_reading_diagnostics = True
+              continue
+        # Set validity status immediately after parsing
+        self.is_valid()
+
+    def _attempt_hs_match(self, name, text):
         """
-        name: pattern to match (e.g. SIMULATION START TIME)
-        var:  variable to assign value if match found
-        text: text to search for pattern
-        returns: var if not found, found value if found
+        Parameters
+        ----------
+        hs_file : str
+            Path to hs_file to parse. If None, self.hs_file is used instead.
+        name : str
+            Pattern to match (e.g. SIMULATION START TIME)
+        text : str
+            Text to search for pattern within
+
+        Returns
+        ----------
+        float  or None
+            Value of name if found, else: None
         """
+        name = name.replace('(', '\(').replace(')', '\)')
         m = re.match(name+': +([-]?[0-9]*\.?[0-9]+)', text.strip())
         if m:
             return(float(m.group(1)))
-        return(var)
-
-    def get(self,name):
-       """
-       Get a value by the name that appears in the send_hs message
-       """
-       if 'ACTUAL INIT TIME' in name:
-           return self.actual_init_time
-       if 'ACTUAL ELAPSED TIME' in name:
-           return self.actual_elapsed_time
-       if 'SIMULATION START TIME' in name:
-           return self.start_time
-       if 'SIMULATION STOP TIME' in name:
-           return self.stop_time
-       if 'SIMULATION ELAPSED TIME' in name:
-           return self.elapsed_time
-       if 'ACTUAL CPU TIME USED' in name:
-           return self.actual_cpu_time_used
-       if 'SIMULATION / CPU TIME' in name:
-           return self.sim_cpu_time
-       if 'INITIALIZATION CPU TIME' in name:
-           return self.init_cpu_time
-       else:
-           return None
+        else:
+            return(None)
+
+    def keys(self) -> list:
+        """
+        Return a list of all possible keys/names that get() can accept This is
+        the master list of all diagnostics we search for when we parse the send_hs
+        file, and naming should match 1:1 with the output of a Trick simulation
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.keys() #doctest: +ELLIPSIS
+        ['ACTUAL INIT TIME', ... 'INVOLUNTARY CONTEXT SWITCHES (RUN)']
+
+        Returns
+        -------
+        list
+            All possible diagnostic names that get() can accept
+        """
+        return (['ACTUAL INIT TIME',
+                 'ACTUAL ELAPSED TIME',
+                 'SIMULATION START TIME',
+                 'SIMULATION STOP TIME',
+                 'SIMULATION ELAPSED TIME',
+                 'USER CPU TIME USED',
+                 'SYSTEM CPU TIME USED',
+                 'SIMULATION / CPU TIME',
+                 'INITIALIZATION USER CPU TIME',
+                 'INITIALIZATION SYSTEM CPU TIME',
+                 'SIMULATION RAM USAGE',
+                 'VOLUNTARY CONTEXT SWITCHES (INIT)',
+                 'INVOLUNTARY CONTEXT SWITCHES (INIT)',
+                 'VOLUNTARY CONTEXT SWITCHES (RUN)',
+                 'INVOLUNTARY CONTEXT SWITCHES (RUN)'
+               ])
+
+    def get_diagnostic (self, name: str) -> float:
+        """
+        Get a diagnostic value by it's name or partial name
+        The first name matched in the self._diagnostics dict will be returned
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.get_diagnostic('ACTUAL INIT TIME')
+        42.606
+        >>> sh.get_diagnostic('SIMULATION RAM USAGE')
+        1198.867
+
+        Parameters
+        ----------
+        name : str
+            Name or partial name of diagnostic to retrieve
+
+        Returns
+        -------
+        float
+            Value of diagnostic name given
+
+        Raises
+        ------
+        LookupError
+            If name cannot be found in self._diagnostics
+        """
+
+        for d in self._diagnostics:
+            if name in d:
+                return(self._diagnostics[d])
+        raise LookupError(f"Unable to get diagnostic '{name}'. Is it spelled correctly?")
+
+    def get(self, name: str) -> float:
+        """
+        Get a diagnostic value by it's name or partial name. Convienence function that
+        calls self.get() directly.
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.get('ACTUAL INIT TIME')
+        42.606
+        >>> sh.get('SIMULATION RAM USAGE')
+        1198.867
+
+        Returns
+        -------
+        float
+            Value of diagnostic name given
+        """
+        return self.get_diagnostic(name)
+
+    @property
+    def diagnostics (self) -> dict:
+        return dict(self._diagnostics)
+
+    def get_diagnostics (self) -> dict:
+        """
+        Get all diagnostics stored in internal self._diagnostics dictionary
+
+        >>> sh = send_hs(hs_file=os.path.join(this_trick,"share/trick/trickops/tests/send_hs.nominal"))
+        >>> sh.get_diagnostics() #doctest: +ELLIPSIS
+        {'ACTUAL INIT TIME': 42.606, 'ACTUAL ELAPSED TIME': 55.551, ... 14.0}
+
+        Returns
+        -------
+        dict
+            A copy of the internal self._diagnostics dictionary
+        """
+        return (self.diagnostics)
diff --git a/share/trick/trickops/tests/run_tests.py b/share/trick/trickops/tests/run_tests.py
@@ -34,7 +34,7 @@ def run_tests(args):
 
     # Run all doc tests by eating our own dogfood
     doctest_files = ['TrickWorkflow.py', 'WorkflowCommon.py', 'TrickWorkflowYamlVerifier.py',
-      'MonteCarloGenerationHelper.py']
+      'MonteCarloGenerationHelper.py', 'send_hs.py']
     wc = WorkflowCommon(this_dir, quiet=True)
     jobs = []
     log_prepend = '_doctest_log.txt'

diff --git a/share/trick/trickops/tests/send_hs.missing_all b/share/trick/trickops/tests/send_hs.missing_all
@@ -0,0 +1,5 @@
+|L   3|2024/11/21,15:54:06|myworkstation| |T 0|0.000000| TrickParamCopy(UserCodeExample): Failed to resolve parameter 'fsw_out.fswOut_block.fswOut.fsw1HzOut.domain.subdomain.variable'
+|L   2|2024/11/21,15:54:06|myworkstation| |T 0|0.000000| TrickParamCopy(UserCodeExample): Mismatched data types for 'fsw_out.fswOut_block.fswOut.fsw1HzOut.domain.subdomain.variable' -> 'another_structure.myvariable'
+|L   2|2024/11/21,15:54:07|myworkstation| |T 0|0.000000| Could not find Data Record variable flex.flex_Obj.flexMode.
+|L   2|2024/11/21,15:54:07|myworkstation| |T 0|0.000000| Could not find Data Record variable flex.flex_Obj.myarray[0].
+|L   2|2024/11/21,15:54:07|myworkstation| |T 0|0.000000| Could not find Data Record variable foo.bar.yippy.skippy.axial_sep_distance.