Skip to content

Fault Injection and Modeling

Thomas Stucky edited this page Jun 3, 2024 · 68 revisions

Table of Contents

Overview

Execution and Goal Errors

A fault is a failure of some kind during lander operation. OceanWATERS models two fundamental kinds of faults: execution errors and goal errors. An execution error is a hardware failure. A goal error is failure to accomplish the objective of an operation.

Goal errors do not imply execution errors. A goal may fail even with perfectly working hardware, e.g. commanding the arm to an unreachable location, or unexpected blockage of the arm's movement. However, execution errors often result in a corresponding goal error. For example, if an arm joint locks during a scooping task, the arm execution error leads to a goal error in the scooping task.

There is a critical difference between these two kinds of faults with respect to autonomy. Execution errors cannot be repaired or cleared (if they occur via injection, described below) directly by autonomy. In a real robotic mission, hardware errors would require the performance of diagnostic and repair procedures, either by autonomy or remotely by humans. In contrast, goal errors must be cleared (removed or reset) by autonomy. Clearing a goal error may involve additional operations to mitigate the error, e.g. trying the operation differently, or simply trying it again.

The specific execution and goal errors supported in OceanWATERS are described in subsequent sections.

Automatically Clear Goal Errors

As of Release 13.1, goal errors are cleared by default whenever a subsequent action in the same category is called. For users not concerned about realistic fault handling, this mode may be preferred. However, it can be changed to the more complex where a call to FaultClear is required to clear the goal error before the next action can be performed. The complex behavior can be used by unticking the parameter auto_clear_goal_errors which can be found in the RQT gui under Dynamic Reconfigure, in the /faults category under fault_behaviors.

Fault Categories

OceanWATERS models 5 categories or groupings of faults.

  1. Arm faults
  2. Camera faults
  3. Antenna pan/tilt faults
  4. Power system faults
  5. System faults

Each category is a set of flags (Boolean variables), one for each fault in that category. The last category, system faults, generalizes the first 4 and adds a few fault types that are not subsystem-specific. All 5 categories include execution errors. By convention, only the system faults category includes goal errors (from all subystems). When a fault in categories 1-4 occurs (e.g. its flag becomes true), one or more corresponding fault flags in system faults is also enabled.

These fault categories and all supported fault types in OceanWATERS are formalized as ROS message types, described in the Fault Flags section below. This section also describes the fault flag interdependencies mentioned above.

Note that there is yet no category for Force/Torque sensor faults.

Fault Injection and Detection

OceanWATERS has components for fault injection and for fault detection.

The fault injection facility allows simulation of degraded lander systems. A variety of faults are injectable by the user, and any of these could occur naturaly if lander systems are sufficiently stressed. Only execution errors (hardware faults) can be injected at present, and not goal errors.

Specific kinds of failures in the lander arm, antenna, power system, camera, and force-torque sensor can be injected dynamically anytime after the simulator is started. Faults may be injected using the RQT GUI, the command line, or scripts.

The fault detection facility is representative of such a component on a real lander. This module monitors lander telemetry and reports faults that would be detectable at the hardware level. Most injectable faults that occur during a simulation, whether injected by the user or as the result of lander operations or lander/environment conditions, are detected by this module. An exception is in the power model system, where the injected and detected faults are different: the latter are the result of the former.

OceanWATERS comes with simple illustrative fault detection plans (written in PLEXIL) that recognize solely the detectable faults. More advanced and higher levels of fault detection would need to be realized in the onboard autonomy;

The high-level design of fault injection and detection and its mechanisms in OceanWATERS, along with the role of autonomy, is shown in the following diagram.

Faults Architecture

Arm Faults

There are various types of arm faults that can take place in a real robotic arm. The list includes free-swinging joints, locked joints, incorrect measured joint position, and incorrect measured joint velocity. OceanWATERS only supports locked joint failure at this time.

Locked Joint Fault

This fault can be injected on each of the joints individually by setting any of the following flags:

dist_pitch_joint_locked_failure
hand_yaw_joint_locked_failure
prox_pitch_joint_locked_failure
scoop_yaw_joint_locked_failure
shou_pitch_joint_locked_failure
shou_yaw_joint_locked_failure

Once injected the specific joint will stop moving immediately and the state will immediately propagate to the arm firmware which will stop the arm from moving altogether in case a motion was taking place. The arm will stay in a stand-still position until any locked joint failures are cleared.

Inhibit arm stop

There is an option applicable to all joint lock failure faults that allows the arm to continue moving even when a joint is locked. Specifically, all joints except for the locked joint(s) will continue moving if the arm is already in motion when the fault(s) are injected. Note that this may result in erroneous and unexpected arm motion.

This option was added to allow autonomy to handle joint locked failures in any desired fashion, rather than stopping the arm entirely in the simulator.

The flag to set this behavior is:

arm_motion_continues_in_fault

NOTE: This feature has not yet been fully characterized or extensively tested.

Antenna Faults

Similar to the arm faults, OceanWATERS currently only implements a locked joint type of failure for the antenna. The antenna has two joints and each of the joints can have the failure injected individually.

ant_pan_joint_locked_failure
ant_tilt_joint_locked_failure

Unlike the arm failure, when a locked joint failure is injected on either of the two joints, this does not result in stopping all motion of the antenna. Instead, only the affected joint will stop moving. Once the specific failure is cleared the antenna will resume the last issued antenna command.

Power Faults

OceanWATERS supports the injection of additional power draw through designating either a flat amount or a profile with set values, the disconnection of one or more battery nodes, and artificial conditions to trigger various power faults.

OceanWATERS provides detection of three kinds of power (battery) faults: low state-of-charge (SOC), thermal fault, and instantaneous capacity loss. Note that any of these may result from a sufficiently high power draw.

Injectable Power Faults

High Power Draw

A sudden high power draw can be injected by specifying a wattage to be added to the lander's current power draw, and setting the high_power_draw fault flag to True. The command for each of these, respectively, is as follows. The added power draw can be any value between 1 and 360 watts.

rosrun dynamic_reconfigure dynparam set /faults high_power_draw [1.0-360.0]
rosrun dynamic_reconfigure dynparam set /faults activate_high_power_draw True

RQT provides a graphical user interface for the same:

RQT faults interface

The high_power_draw selection has a slider bar that can be adjusted between 1W and 360W, and the exact wattage desired can also be typed into the box to its right. The fault is injected by checking the activate_high_power_draw box.

Using either the command line or RQT, the wattage increment can be dynamically adjusted while the fault is active.

Note: Previous versions of the power system used a single-cell model from GSAP, which has an upper limit on the wattage it can handle. The current implementation utilizes a 6S1P model, which can handle much higher wattages, though it is currently unknown how high power draw can go without the new model's behavior deteriorating. At present, the cap is set to 6 times the single-cell model's previous cap of 15W, for 90W per model or 360W total with four models. If the total power draw to a model were to exceed 90W, a warning will be printed in the simulator's terminal and the power draw will be capped.

High power draw, whether or not it occurs through fault injection, depletes the battery's charge more quickly and causes its temperature to rise at an accelerated rate. Unless the power draw is removed (e.g. fault cleared) in time, one or more of the three detectable power faults (described in the Detectable Power Faults section) will be triggered.

Custom Power Faults

A customized power fault can be modeled using a fault profile that characterizes the fault. A fault profile is an ordered list of power values that is added to the prognoser's current values for these inputs.

Fault profiles are specified in comma-separated-values (CSV) files, which must be placed in ow_simulator/ow_power_system/profiles. The file has two columns: index & wattage. A single row is applied every cycle of the main power system loop, whose frequency is dependent on the loop_rate setting within ow_simulator/ow_power_system/config/system.cfg. For example, if the loop rate is 2Hz, the fault profile will read & apply 2 rows of wattage per second.

A custom fault is injected by specifying its profile's filename and then activating the custom fault flag. The commands for these, respectively, are the following.

rosrun dynamic_reconfigure dynparam set /faults custom_fault_name [file_name]
rosrun dynamic_reconfigure dynparam set /faults activate_custom_fault True

RQT provides a graphical user interface for the same -- see the image in the previous section. Type the desired profile's filename into custom_fault_profile1 and then check activate_custom_fault to inject the fault.

When a fault profile is exhausted (i.e. all its lines have been processed), the custom fault will be deactivated. It may be reactivated, and the profile will start from the beginning.

If a custom fault is deactivated before it is exhausted and then reactivated later, the profile will start from the beginning.

Battery Disconnection Fault

The user has the option to simulate losing connection to some of the parallel battery cells by setting battery_nodes_to_disconnect and enabling it via disconnect_battery_nodes, either in the terminal or the RQT interface:

rosrun dynamic_reconfigure dynparam set /faults battery_nodes_to_disconnect [1-3]
rosrun dynamic_reconfigure dynparam set /faults disconnect_battery_nodes True

When a cell is disconnected, its simulation is irrevocably terminated and any additional power draw applied to the battery is redistributed across the remaining cells. While this does not result in an immediate decline of battery health, it will deteriorate much more quickly from any future power draw. The power system simulates up to four 6S1P models and up to three of them can be disconnected via this fault.

Note that disconnected nodes, by design, cannot be recovered in any way. Disabling or reducing the values in the fault after a prior activation does not have any effect.

Artificial Power Faults

There are three kinds of power faults monitored for by the simulator: low state of charge, instantaneous capacity loss and thermal failure. The specific conditions of each of these faults is described in the later Detectable Power Faults section.

While these power faults can occur naturally (e.g. the battery will eventually reach a low state of charge given a long enough simulation), they can also be artificially injected via the RQT window (or in the terminal) in order to force the battery's outputs to a state that triggers the relevant fault flag immediately:

rosrun dynamic_reconfigure dynparam set /faults low_state_of_charge True
rosrun dynamic_reconfigure dynparam set /faults instantaneous_capacity_loss True
rosrun dynamic_reconfigure dynparam set /faults thermal_failure True

This allows users to test the behavior associated with these faults, but they are not realistic. The battery will continue generating normal predictions heedless of the artificial faults; its outputs are simply suppressed while any artificial faults are active. The battery will return to its previous state without any lingering behavior once the user removes the faults.

Combining Power Faults

High power draw and custom faults can be combined, in any order, and each can be deactivated and reactivated as desired. The total wattage increment at each timepoint will be that of the high power draw added to that of the custom fault for that timepoint.

Battery disconnection will work fully with high power draw and custom faults; the draw applied to each node will update as needed.

While the artificial power faults fully work with each other, they do not work with the other faults. This is because the artificial power faults override any predictions generated by GSAP, of which the other faults influence. Removing the artificial power faults will see the predictions return to normal, as if they were never activated in the first place.

Detectable Power Faults

Three kinds of battery failures are detected by the OceanWATERS simulator. Each of them, when detected, will set the value of the PowerFaults message (described below) to 1, indicating a hardware failure.

Low State of Charge

A low state-of-charge power failure is an indicator that the state-of-charge (SOC) of the battery, which is the percentage of its charge remaining, has fallen below a predefined threshold. This threshold is based on the lander's operational requirements and is currently set at 0.10 (10%).

The following plot illustrates the typical effect of a sustained ~360 watt power draw on the /battery_state_of_charge telemetry.

Instantaneous Capacity Loss

An instantaneous capacity loss power failure is related SOC and results when two consecutive SOC readings exceed a predefined delta, currently set at 0.05 (5%).

This fault typically represents a sudden high power drawn from the battery by an unknown parasitic load, which leads to a rapid drop in the battery capacity.

The following plot illustrates the expected effect of instantaneous capacity loss on the /battery_state_of_charge telemetry.

Thermal Failure

A thermal fault indicates a rise in the battery temperature above a preset safety temperature threshold, which is 70 degrees Celsius in the presently modeled battery. This threshold is such that recovery is possible if the fault is cleared before the temperature is reached; the battery should gradually cool and continue performing normal operation.

Not presently modeled is thermal runaway, i.e. the battery temperature going so high that recovery is not possible. In the present model, the battery will shut down if and when the aforementioned safety temperature threshold is reached.

Remaining Useful Life

The estimated remaining useful life (RUL) of the battery, published on the topic /battery_remaining_useful_life, should degrade linearly and update depending on the lander's power draw. The following plot illustrates the estimated RUL of the battery when maximal power draw is applied. Note that as RUL decreases, GSAP responsiveness increases, so the value updates more often.

Please note that RUL will not update accordingly if Gazebo's realtime factor is increased to make the simulation run faster. This issue is being investigated as of September 22, 2023.

Battery Shutdown

If either a low SOC or thermal fault occurs, and is not cleared in time (either by deactivating an injected fault or reducing the load on the battery), the battery's condition will continue to worsen. Eventually the battery will shut down, and the OceanWATERS simulation will effectively be over.2

Battery shutdown is indicated by all of the three battery telemetries having a value of 0.0:

/battery_state_of_charge
/battery_remaining_useful_life
/battery_temperature

Specifically, when the either the SOC or battery temperature read 0.0, all three telemetries will become 0.0, indicating battery failure. Due to the stochasticity in the GSAP prognostics module, either the high temperature or low voltage threshold may be crossed first.

Force Torque Sensor Faults

The simulation currently models three types of faults that affect the behavior of the force torque (f/t) sensor:

  • zero signal
  • signal bias
  • increase signal noise

You can inject any combination of the three faults using the provided RQT faults interface (see more about RQT in the Usage section below):

The enable checkbox needs to be checked for any of the f/t sensor faults to take effect.

The following table lists each individual fault along with the observed f/t sensor response:

Fault Signal
Zero Signal
Signal Bias
Increased Signal Noise

At this time, force torque sensor faults are not supported by the fault detection module, nor included in the fault categories and their fault flags. Their detection is a task for autonomy.

Camera Faults

OceanWATERS simulates a stereo camera and currently supports injecting a single fault, camera_left_trigger_failure. This fault simulates an event where a camera is triggered, but no image is returned.

Unlike other faults in the simulator, injecting the trigger failure alone is not sufficient for detecting this fault later. The fault is detected when the trigger is actually activated, i.e. by publishing an empty message on the topic /StereoCamera/left/image_trigger.

Similarly, when this fault is cleared, the exoneration of the fault does not occur until the camera is triggered again.

This fault works by inhibiting the publishing of the captured point cloud, on the topic /StereoCamera/left/image_raw, after the camera is triggered.

This error will be published as a CAMERA_EXECUTION_ERROR in the SystemFaultsStatus message, and as a NO_IMAGE error in the CameraFaultsStatus message.

Usage

Faults can be set in rqt, on the command line, or from a python script.

rqt

In the rqt application, look at the faults section in the Dynamic Reconfigure tab. There you can see and inject all faults available in OceanWATERS, by checking the boxes next to the desired faults.

RQT faults interface

The rqt interface is helpful for becoming familiar with faults settings, and for interactive simulation. The following sections describe how to inject faults using scripts and on the command line.

Python script

Include the dynamic_reconfigure client API:

import dynamic_reconfigure.client

Next, set faults on the faults node from anywhere in your script:

client = dynamic_reconfigure.client.Client('/faults')
params = { 'ant\_pan_encoder_failure' : 'True',
'ant_tilt_torque_sensor_failure' : 'True' }
config = client.update_configuration(params)

Command line

List all available faults with the following command:

rosparam list /faults

To inject a particular fault:

rosrun dynamic_reconfigure dynparam set /faults ant_pan_encoder_failure True

Here is more information about dynparam.

NOTE: you cannot use rosparam to set faults:

rosparam set /faults/ant_pan_encoder_failure True

This will set the parameter on the faults node, but it will not propagate to the rest of the simulation.

Fault Flags

The fault categories introduced above are formalized as ROS messages types, each associated with a ROS topic. The messages define fault flags, which are bit fields that encode specific fault conditions. Fault flags are set by the fault detection module described above. When a flag is set in the Arm, Camera, Antenna, or Power messages, one or more corresponding flags in System message is also set. The fault flags, their message definitions, and interrelationship are as follow.

  1. SystemFaultsStatus message, published on the /system_faults_status topic. When a flag is set in any of the subsequent messages, one more flags here also gets set. The SYSTEM flag is used for any miscellaneous systems error that is not modeled explicitly.
Header header
uint64 value
uint64 NONE = 0
uint64 SYSTEM = 1
uint64 ARM_GOAL_ERROR = 2
uint64 ARM_EXECUTION_ERROR = 4
uint64 TASK_GOAL_ERROR = 8
uint64 CAMERA_GOAL_ERROR = 16
uint64 CAMERA_EXECUTION_ERROR = 32
uint64 PAN_TILT_GOAL_ERROR = 64
uint64 PAN_TILT_EXECUTION_ERROR = 128
uint64 POWER_EXECUTION_ERROR = 256
  1. ArmFaultsStatus message, published on the /arm_faults_status topic. When any of these flags are set, so is ARM_EXECUTION_ERROR, and possibly ARM_GOAL_ERROR in SystemsFaultsStatus.
Header header
uint64 value
uint64 NONE                  = 0
uint64 HARDWARE              = 1
uint64 TRAJECTORY_GENERATION = 2
uint64 COLLISION             = 4
uint64 E_STOP                = 8
uint64 POSITION_LIMIT        = 16
uint64 JOINT_TORQUE_LIMIT    = 32
uint64 VELOCITY_LIMIT        = 64
uint64 NO_FORCE_DATA         = 128
uint64 FORCE_TORQUE_LIMIT    = 256
  1. PanTiltFaultsStatus message, published on the /pan_tilt_faults_status topic. When any of these flags are set, so is PAN_TILT_EXECUTION_ERROR, and possibly PAN_TILT_GOAL_ERROR in SystemsFaultsStatus.
Header header
uint64 value
uint64 NONE = 0
uint64 PAN_JOINT_LOCKED = 1
uint64 TILT_JOINT_LOCKED = 2
  1. PowerFaultsStatus message, published on the /power_faults_status topic. When any of these flags are set, so is POWER_EXECUTION_ERROR, in SystemsFaultsStatus.
Header header
uint64 value
uint64 NONE                        = 0
uint64 LOW_STATE_OF_CHARGE         = 1
uint64 INSTANTANEOUS_CAPACITY_LOSS = 2
uint64 THERMAL_FAULT               = 4
  1. CameraFaultsStatus message, published on the /camera_faults_status topic. When the sole fault here is active, so is CAMERA_EXECUTION_ERROR, and possibly CAMERA_GOAL_ERROR in SystemsFaultsStatus.
Header header
uint64 value
uint64 NONE     = 0
uint64 NO_IMAGE = 1

Altering and remapping telemetry

In essence, fault injection alters lander telemetry, giving it a "faulty" value. In some cases, telemetry is remapped such that the original telemetry is preserved on a "hidden" topic (used by the simulator). For example, when an arm fault is injected, /joint_states is renamed to /_original/joint_states. This 'original' topic is routed first to the faults node, which injects any faults to the 'new' topic /joint_states. All other nodes that previously subscribed to the /joint_states topic are now receiving the version that the faults node outputs. This is further illustrated as follows.

Prior to fault injection: Gazebo -> /joint_states (original) -> joint_state_publisher

With injected fault:

Gazebo -> /_original/joint_states (original)
       -> faults (where faults are injected)
       -> /joint_states (new)
       -> joint_state_publisher

  1. A file selection dialog would be convenient and less-error prone than type-in, but no mechanism by which to do this in rqt was easily found at the time of development.

  2. Unfortunately, lander operations are power-agnostic at this time. They will continue to operate as commanded, regardless of the state of the battery. Logic to terminate lander operations under low power conditions is up to autonomy, e.g. PLEXIL. Exemplary power mitigation PLEXIL plans are under development at the time of this writing.