Alias: http://resiliencepapers.club (thanks to John Allspaw).
This file contains notes about people active in resilience engineering, organized alphabetically. I'm using these notes to help get my head around the players and concepts.
You might also be interested in my notes on David Woods's Resilience Engineering short course.
For each person, I list concepts that they reference in their writings, along with some publications. The publications lists aren't comprehensive: they're ones I've read or have added to my to-read list.
- John Allspaw
- Lisanne Bainbridge
- Andrea Baker
- Johan Bergström
- Todd Conklin
- Richard I. Cook
- Sidney Dekker
- John C. Doyle
- Bob Edwards
- Anders Ericsson
- Meir Finkel
- Ivonne Andrade Herrera
- Erik Hollnagel
- Leila Johannesen
- Gary Klein
- Nancy Leveson
- Anne-Sophie Nyssen
- Elinor Ostrom
- Jean Pariès
- Emily Patterson
- Charles Perrow
- Shawna J. Perry
- Jens Rasmussen
- James Reason
- Nadine Sarter
- James C. Scott
- Steven Shorrock
- Diane Vaughan
- Robert L. Wears
- David Woods
- John Wreathall
Some big ideas:
- The adaptive universe (David Woods)
- Dynamic safety model (Jens Rasmussen)
- Safety-II (Erik Hollnagel)
- Graceful extensibility (David Woods)
- ETTO: Efficiency-tradeoff principle (Erik Hollnagel)
- Drift into failure (Sidney Dekker)
- Robust yet fragile (John C. Doyle)
- STAMP: Systems-Theoretic Accident Model & Process (Nancy Leveson)
- Polycentric governance (Elinor Ostrom)
Allspaw is the former CTO of Etsy. He applies concepts from resilience engineering to the tech industry. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.
Allspaw tweets as @allspaw.
- Trade-Offs Under Pressure: Heuristics and Observations Of Teams Resolving Internet Service Outages
- Etsy Debrief Facilitation Guide
- Blameless PostMortems and a Just Culture (blog)
- Resilience engineering: learning to embrace failure
- Fault Injection in Production: Making the case for resiliency testing
- Incidents as we Imagine Them Versus How They Actually Are
- Problem detection (papers we love) (presentation of Problem detection paper)
- Common Ground and Coordination in Joint Activity (papers we love) (presentation of Common Ground and Coordination in Joint Activity paper)
Bainbridge is (was?) a psychology researcher. (I have not been able to find any recent information about her).
Bainbridge is famous for her 1983 Ironies of automation paper, which continues to be frequently cited.
- automation
- design errors
- human factors/ ergonomics
- cognitive modelling
- cognitive architecture
- mental workload
- situation awareness
- cognitive error
- skill and training
- interface design
Baker is a practitioner who provides training services in human and organizational performance (HOP) and learning teams.
Baker tweets as @thehopmentor.
- Human and organizational performance (HOP)
- Learning teams
- Industrial empathy
- A bit about HOP (editorial)
- A short introduction to human and organizational performance (hop) and learning teams (blog post)
Bergström is a safety research and consultant. He runs the Master Program of Human Factors and Systems Safety at Lund University.
Bergström tweets as @bergstrom_johan.
- Analytical traps in accident investigation
- Counterfactual reasoning
- Normative language
- Mechanistic reasoning
- Resilience engineering: Current status of the research and future challenges
- Rule- and role retreat: An empirical study of procedures and resilience
Conklin's books are on my reading list, but I haven't read anything by him yet. I have listened to his great Preaccident investigation podcast.
Conklin tweets as @preaccident.
- Pre-accident investigations: an introduction to organizational safety
- Pre-accident investigations: better questions - an applied approach to operational learning
Cook is a medical doctor who studies failures in complex systems. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.
Cook tweets as @ri_cook.
- complex systems
- degraded mode
- sharp end (c.f. Reason's blunt end)
- Going solid
- Cycle of error
- "new look"
- How complex systems fail
- Where complex systems fail
- Distancing through differencing: An obstacle to organizational learning following accidents
- Being bumpable
- Behind Human Error
- Incidents - markers of resilience or brittleness?
- “Going solid”: a model of system dynamics and consequences for patient safety
- Operating at the Sharp End: The Complexity of Human Error
- Patient boarding in the emergency department as a symptom of complexity-induced risks
- Sensemaking, Safety, and Cooperative Work in the Intensive Care Unit
- Medication Reconciliation Is a Window into “Ordinary” Work
- Cognitive consequences of clumsy automation on high workload, high consequence human performance
- Implications of automation surprises in aviation for the future of total intravenous anesthesia (TIVA)
- The Messy Details: Insights From the Study of Technical Work in Healthcare
- Nosocomial automation: technology-induced complexity and human performance
- The New Look at Error, Safety, and Failure: A Primer for Health Care
- Grounding explanations in evolving, diagnostic situations
- A Tale of Two Stories: Contrasting Views of Patient Safety
Dekker is a human factors and safety researcher with a background in aviation. His books aimed at a lay audience (Drift Into Failure, The Field Guide to 'Human Error' investigations) have been enormously influential. His PhD advisor is David Woods.
Dekker tweets as @sidneydekkercom.
Dekker developed the theory of drift, characterized by five concepts:
- Scarcity and competition
- Decrementalism, or small steps
- Sensitive dependence on initial conditions
- Unruly technology
- Contribution of the protective structure
- Drift into failure
- Safety differently
- New view vs old view of human performance
- Just culture
- complexity
- broken part
- Newton-Descartes
- diversity
- systems theory
- unruly technology
- decrementalism
- Drift into failure
- Reconstructing human contributions to accidents: the new view on error and performance
- The field guide to 'human error' investigations
- Behind Human Error
- Rule- and role retreat: An empirical study of procedures and resilience
- Anticipating the effects of technological change: A new era of dynamics for human factors
Doyle is a control systems researcher. He is seeking to identify the universal laws that capture the behavior of resilient systems, and is concerned with the architecture of such systems.
- Robust yet fragile
- layered architectures
- constraints that deconstrain
- protocol-based architectures
- emergent constraints
- Universal laws and arcthitectures
- conservation laws
- universal architectures
- Highly optimized tolerance
- Universal Laws and Archiectures (slides)
- Contrasting Views of Complexity and Their Implications For Network-Centric Infrastructures
- Architecture, constraints, and behavior
- The “robust yet fragile” nature of the Internet
- Highly Optimized Tolerance: Robustness and Design in Complex Systems
- Robust efficiency and actuator saturation explain healthy heart rate control and variability
Edwards is a practitioner who provides training services in human and organizational performance (HOP).
Edwards tweets as @thehopcoach.
Ericsson introduced the idea of deliberate practice as a mechanism for achieving high level of expertise.
Ericsson isn't directly associated with the field of resilience engineering. However, Gary Klein's work is informed by his, and I have a particular interest in how people improve in expertise, so I'm including him here.
- Expertise
- Deliberate practice
- Protocol analysis
Finkel is a Colonel in the Israeli Defense Force (IDF) and the Director of the IDF's Ground Forces Concept Development and Doctrine Department
Herrera is an associate professor in the department of industrial economics and technology management at NTNU and a senior research scientist at SINTEF. Her areas of expertise include safety management and resilience engineering in avionics and air traffic management.
Hollnagel proposed that there is always a fundamental tradeoff between efficiency and thoroughness, which he called the ETTO principle.
Safety-I: avoiding things that go wrong
- looking at what goes wrong
- bimodal view of work and activities (acceptable vs unacceptable)
- find-and-fix approach
- prevent transition from 'normal' to 'abnormal'
- causality credo: believe that adverse outcomes happen because something goes wrong (they have causes that can be found and treated)
- it either works or it doesn't
- systems are decomposable
- functioning is bimodal
Saefty-II: performance variability rather than bimodality
- the system’s ability to succeed under varying conditions, so that the number of intended and acceptable outcomes (in other words, everyday activities) is as high as possible
- performance is always variable
- performance variation is ubiquitous
- things that go right
- focus on frequent events
- remain sensitive to possibility of failure
- be thorough as well as efficient
Hollnagel proposed the Functional Resonance Analysis Method (FRAM) for modeling complex socio-technical systems.
- ETTO (efficiency thoroughness tradeoff) principle
- FRAM (functional resonance analysis method)
- Safety-I and Safety-II
- things that go wrong vs things that go right
- causality credo
- performance variability
- bimodality
- emergence
- work-as-imagined vs. work-as-done
- joint cognitive systems
- The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong
- From Safety-I to Safety-II: A White Paper
- Safety-II in Practice
- Safety-I and Safety-II: The past and future of safety management
- FRAM: The Functional Resonance Analysis Method: Modelling Complex Socio-technical System
- Joint Cognitive Systems: Patterns in Cognitive Systems Engineering
- Resilience Engineering: Concepts and Precepts
- I want to believe: some myths about the management of industrial safety
- Resilience engineering – Building a Culture of Resilience (slides)
Johannesen is currently a UX researcher and community advocate at IBM. Her PhD dissertation work examined how humans cooperate, including studies of anesthesiologists.
- common ground
- Grounding explanations in evolving, diagnostic situations
- Maintaining common ground: an analysis of cooperative communication in the operating room
Klein studies how experts are able to quickly make effective decisions in high-tempo situations.
Klein tweets as @KleInsight.
- naturalistic decision making (NDM)
- intuitive expertise
- cognitive task analysis
- common ground
- problem detection
- automation as a "team player"
- Sources of power: how people make decisions
- Working minds: a practitioner's guide to cognitive task analysis
- Patterns in Cooperative Cognition
- Common Ground and Coordination in Joint Activity
- Can We Trust Best Practices? Six Cognitive Challenges of Evidence-Based Approaches
- Conditions for intuitive expertise: a failure to disagree
- Problem detection
- Ten challenges for making automation a team player
Nancy Leveson is a computer science researcher with a focus in software safety.
Leveson developed the accident causality model known as STAMP: the Systems-Theoretic Accident Model and Process.
See STAMP for some more detailed notes of mine.
- Software safety
- STAMP (systems-theoretic accident model and processes)
- STPA (system-theoretic process analysis) hazard analysis technique
- CAST (causal analysis based on STAMP) accident analysis technique
- Systems thinking
- hazard
- interactivy complexity
- system accident
- dysfunctional interactions
- safety constraints
- control structure
- dead time
- time constants
- feedback delays
- A New Accident Model for Engineering Safer Systems
- Engineering a safer world
- STPA Handbook
- Safeware
- Resilience Engineering: Concepts and Precepts
- High-pressure steam engines and computer software
- Resilience Engineering: Concepts and Precepts
Nyssen is a psychology professor at the University of Liège, who does research on human error in complex systems, in particular in medicine.
A list of publications can be found on her website linked above.
Ostrom was a Nobel-prize winning economics and political science researcher.
- Coping with tragedies of the commons
- Governing the Commons: The Evolution of Institutions for Collective Action
- tragedy of the commons
- polycentric governance
- social-ecological system framework
Pariès is the president of Dédale, a safety and human factors consultancy.
Patterson is a researcher who applies human factors engineering to improve patient safety in healthcare.
Perrow is a sociologist who studied the Three Mile Island disaster.
- Normal accidents
- Common-mode
Perry is a medical researcher who studies emergency medicine.
- Underground adaptations
- Articulated functions vs. important functions
- Unintended effects
- Apparent success vs real success
- Exceptions
- Dynamic environments
- Underground adaptations: case studies from health care
- Can We Trust Best Practices? Six Cognitive Challenges of Evidence-Based Approaches
Jens Rasmussen was a very influential researcher in human factors and safety systems.
TBD
Rasmussen proposed a state-based model of a socio-technical system as a system that moves within a region of a state space. The region is surrounded by different boundaries:
- economic failure
- unacceptable work load
- functionality acceptable performance
Source: Risk management in a dynamic society: a modelling problem
Incentives push the system towards the boundary of acceptable performance: accidents happen when the boundary is exceeded.
TBD
Rasmussen proposed a multi-layer view of socio-technical systems:
Source: Risk management in a dynamic society: a modelling problem
- Dynamic safety model
- Migration toward accidents
- Risk maangement framework
- Boundaries:
- boundary of functionally acceptable performance
- boundary to economic failure
- boundary to unnaceptable work load
- Cognitive systems engineering
- Skill-rule-knowledge (SKR) model
- AcciMaps
- Means-ends hierarchy
- Ecological interface design
- Systems approach
- Control-theoretic
- decisions, acts, and errors
- hazard source
- anatomy of accidents
- energy
- systems thinking
- trial and error experiments
- defence in depth (fallacy)
- Role of managers
- Information
- Competency
- Awareness
- Commitment
- Going solid
- Reflecting on Jens Rasmussen’s legacy. A strong program for a hard problem (my notes)
- Risk management in a dynamic society: a modelling problem
- Coping with complexity
- “Going solid”: a model of system dynamics and consequences for patient safety
- Human error and the problem of causality in analysis of accidents
Reason is a psychology researcher who did work on understanding and categorizing human error.
Reason developed an accident casuation model that is sometimes known as the swiss cheese model of accidents. In this model, Reason introduced the terms "sharp end" and "blunt end".
Reason developed a model of the types of errors that humans make:
- slips
- lapses
- mistakes
- Blunt end
- Human error
- Slips, lapses and mistakes
- Swiss cheese model
Sarter is a researcher in industrial and operations engineering. She is the director of the Center for Ergonomics at the University of Michigan.
- cognitive ergonomics
- organization safety
- human-automation/robot interaction
- human error / error management
- attention / interruption maangement
- design of decision support systems
- Learning from Automation Surprises and "Going Sour" Accidents: Progress on Human-Centered Automation
- Behind Human Error
- Designed-Induced Error and Error-Informed Design: A Two-Way Street
Wears was a medical researcher who studied emergency medicine.
- Underground adaptations
- Articulated functions vs. important functions
- Unintended effects
- Apparent success vs real success
- Exceptions
- Dynamic environments
- Systems of care are intrinsically hazardous
- The error of counting "errors"
- Underground adaptations: case studies from health care
- Fundamental On Situational Surprise: A Case Study With Implications For Resilience
- Replacing Hindsight With Insight: Toward Better Understanding of Diagnostic Failures
- Seeing patient safety ‘Like a State’
Scott is an anthropologist who also does research in political science. While Scott is not a member of a resilience engineering community, his book Seeing like a state has long been a staple of the cognitive systems engineering and resilience engineering communities.
- authoritarian high-modernism
- legibility
- mētis
Shorrock is a chartered psychologist and a chartered ergonomist and human factors specialist. He is the editor-in-chief of EUROCONTROL HindSight magazine. He runs the excellent Humanistic Systems blog.
Shorrock tweets as @StevenShorrock.
Vaughan is a sociology researcher who did a famous study of the NASA Challenger accident.
- normalization of deviance
Woods has a resesarch background in cognitive systems engineering and did work researching NASA accidents. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.
Woods tweets as @ddwoods2.
Woods has contributed an enormous number of concepts.
Woods uses the adaptive universe as a lens for understanding the behavior of all different kinds of systems.
All systems exist in a dynamic environment, and must adapt to change.
A successful system will need to adapt by virtue of its success.
Systems can be viewed as units of adaptive behavior (UAB) that interact. UABs exist at different scales (e.g., cell, organ, individual, group, organization).
All systems have competence envelopes, which are constrained by boundaries.
The resilience of a system is determined by how it behaves when it comes near to a boundary.
See Resilience Engineering Short Course for more details.
- Trigger
- Units of adaptive behavior
- Goals and goal conflicts
- Pressure points
- Subcycles
From The theory of graceful extensibility: basic rules that govern adaptive systems:
(Longer wording)
- Adaptive capacity is finite
- Events will produce demands that challenge boundaries on the adaptive capacity of any UAB
- Adaptive capacities are regulated to manage the risk of saturating CfM
- No UAB can have sufficient ability to regulate CfM to manage the risk of saturation alone
- Some UABs monitor and regulate the CfM of other UABs in response to changes in the risk of saturation
- Adaptive capacity is the potential for adjusting patterns of action to handle future situations, events, opportunities and disruptions
- Performance of a UAB as it approaches saturation is different from the perforamnce of that UAB when it operates far from saturation
- All UABs are local
- There are bounds on the perspective any UAB, but these limits are overcome by shifts and contrasts over multiple perspectives.
- Reflective systems risk mis-calbiration
(Shorter wording)
- Boundaries are universal
- Surprise occurs, continuously
- Risk of saturation is monitored and regulated
- Synchronization across multiple units of adaptive behavior in a network is necessary
- Risk of saturation can be shared
- Pressure changes what is sacrificed when
- Pressure for optimality undermines graceful extensibility
- All adaptive units are local
- Perspective contrast overcomes bounds
- Mis-calibration is the norm
Many of these are mentioned in Woods's short course.
- the adaptive universe
- unit of adaptive behavior (UAB), adaptive unit
- adapative capacity
- continuous adaptation
- graceful extensibility
- sustained adaptability
- Tangled, layered networks (TLN)
- competence envelope
- adaptive cycles/histories
- precarious present (unease)
- resilient future
- tradeoffs, five fundamental
- florescence: the degree that changes in one area tend to recruit or open up beneficial changes in many other aspects of the network - which opens new opportunities across the network ...
- reverberation
- adaptive stalls
- borderlands
- anticipate
- synchronize
- proactive learning
- initiative
- reciprocity
- SNAFUs
- robustness
- surprise
- dynamic fault management
- software systems as "team players"
- multi-scale
- brittleness
- decompensation
- working at cross-purposes
- proactive learning vs getting stuck
- oversimplification
- fixation
- fluency law, veil of fluency
- capacity for maneuver (CfM)
- crunches
- sharp end, blunt end
- adaptive landscapes
- law of stretched systems: Every system is continuously stretched to operate at capacity.
- cascades
- adapt how to adapt
- unit working hard to stay in control
- you can monitor how hard you're working to stay in control (monitor risk of saturation)
- reality trumps algorithms
- stand down
- time matters
- Properties of resilient organizations
- Tangible experience with surprise
- uneasy about the precarious present
- push intiative down
- reciprocity
- align goals across multiple units
- goal conflicts, goal interactions (follow them!)
- to understand system, must study it under load
- adaptive races are unstable
- adaptive traps
- roles, nesting of
- hidden interdependencies
- net adaptive value
- matching tempos
- tilt toward florescence
- linear simplification
- common ground
- problem detection
- joint cognitive systems
- automation as a "team player"
- "new look"
- Resilience Engineering: Concepts and Precepts
- Resilience is a verb
- Four concepts for resilience and the implications for the future of resilience engineering
- How adaptive systems fail
- Resilience and the ability to anticipate
- Distancing through differencing: An obstacle to organizational learning following accidents
- Essential characteristics of resilience
- Learning from Automation Surprises and "Going Sour" Accidents: Progress on Human-Centered Automation
- Behind Human Error
- Joint Cognitive Systems: Patterns in Cognitive Systems Engineering
- Patterns in Cooperative Cognition
- Origins of cognitive systems engineering
- Incidents - markers of resilience or brittleness?
- The alarm problem and directed attention in dynamic fault management
- Can We Trust Best Practices? Six Cognitive Challenges of Evidence-Based Approaches
- Operating at the Sharp End: The Complexity of Human Error
- The theory of graceful extensibility: basic rules that govern adaptive systems
- Simon's Slice: Five Fundamental Tradeoffs that Bound the Performance of Human Work Systems
- Anticipating the effects of technological change: A new era of dynamics for human factors
- Common Ground and Coordination in Joint Activity
- Resilience as Graceful Extensibility to Overcome Brittleness
- Resilience Engineering: Redefining the Culture of Safety and Risk Management
- Problem detection
- Cognitive consequences of clumsy automation on high workload, high consequence human performance
- Implications of automation surprises in aviation for the future of total intravenous anesthesia (TIVA)
- Ten challenges for making automation a team player
- The Messy Details: Insights From the Study of Technical Work in Healthcare
- Nosocomial automation: technology-induced complexity and human performance
- Human-centered software agents: Lessons from clumsy automation
- STELLA: Report from the SNAFUcatchers Workshop on Coping with Complexity
- The New Look at Error, Safety, and Failure: A Primer for Health Care
- Grounding explanations in evolving, diagnostic situations
- Resilience Engineering: Concepts and Precepts
- A Tale of Two Stories: Contrasting Views of Patient Safety
Wreathall is an expert in human performance in safety. He works at the WreathWood Group, a risk and safety studies consultancy.
Wreathall tweets as @wreathall.