00-Overview.Rmd

# Signal, noise, and statistical process control {#SPC}

## Signal and noise

People are really, really good at finding patterns that aren't real, especially in noisy data.  

Every metric has natural variation---*noise*---included as an inherent part of that process. True signals only emerge when you have properly characterized that variation. Statistical process control (SPC) charts---run charts and control charts---help you characterize and identify non-random patterns that suggest your process has changed. 

In essence, SPC tools help you evaluate the stability and predictability of a process or its outcomes. Statistical theory provides the basis by which we can evaluate metrics to more confidently detect changes in the underlying process amongst the noise of natural variation. Since it is impossible to know or even measure every single variable that might influence a metric, we can use probability and statistics to evaluate how that metric naturally fluctuates over time (aka common cause variation), and construct guidelines around that fluctuation to help indicate when something in that process has changed (special cause variation). 

Understanding natural, random variation in time series or sequential data is the essential point of quality assurance or process and outcome improvement efforts. It's a rookie mistake to use SPC tools to focus solely on the values themselves or their central tendency---you need to look at [*all* of the elements](#guidelines) of a run or control chart to understand what it's telling you.  

For example, the graph below shows a process created using random numbers based on a pre-defined normal distribution, where the overall distribution is shown in a histogram to the right of the run chart.  

<br>  

```{r ggmarg, fig.height=3, echo=FALSE}
set.seed(250)
df = data.frame(x = seq(1:120), y = 18+rnorm(120))

nat_var_run_plot = ggplot(df, aes(x, y)) + 
  ylim(14.75, 21.25) +
  geom_hline(aes(yintercept=18), color="gray", size=1) +
  xlab("Subgroup") + 
  ylab("Value") +
  geom_line() + 
  theme_bw()

ggMarginal(nat_var_run_plot, margins="y", type = "histogram", binwidth=0.5)
```

<br>  

The next plot shows control limits and 1-2$\sigma$ bands for reference, which are important pieces of SPC control charts. Alongside an understanding of the metric and its measurement and operational context, we can use [guidelines](#guidelines) developed from statistics and probability to evaluate the statistical process of a metric in more detail. This ultimately helps us determine whether we should investigate the process for special cause variation.      

<br>  

```{r ggmarg_cc, fig.height=3, echo=FALSE}
nat_var_cc_plot = ggplot(df, aes(x, y)) + 
  ylim(14.75, 21.25) +
  geom_hline(aes(yintercept=18), color="gray", size=1) +
  geom_hline(aes(yintercept=20.96), color="red") +
  geom_hline(aes(yintercept=15.1), color="red") +
  geom_ribbon(aes(ymin = 18.98, ymax = 19.96), alpha = 0.2) +
  geom_ribbon(aes(ymin = 16.04, ymax = 17.02), alpha = 0.2) +
  xlab("Subgroup") + 
  ylab("Value") +
  geom_line() + 
  theme_bw()

ggMarginal(nat_var_cc_plot, margins="y", type = "histogram", binwidth=0.5)
```

<br>  

Note that some of the control chart [guidelines](#guidelines) for detecting special causes suggest that some special cause variation has occurred in this data. Since this dataset was generated using random numbers from a known normal distribution, these are false positives. It's important to remember that control charts are meant to strike a balance bewtween true and false positives, but can never entirely eliminate false signals. 

<br>  

## SPC tools

Run and control charts are the basic tools of SPC analysis. Other basic statistical graphs---particularly line charts and histograms---are equally important to SPC work.    

Line charts help you monitor any sort of metric, process, or time series data. Run and control charts are meant to help you identify departures from a **stable** process. Each uses a set of guidelines to help you make decisions on whether a process has changed or not. 

In many cases, a run chart may be all you need. In *all* cases, you should [start with a line chart and histogram](#histoline). If---and only if---you need to characterize the limits of natural variation in a stable process, you can move on to using a control chart.  

In addition, *never* rely on a table or year-to-date (YTD) comparisons to evaluate process performance. These approaches ignore the foundational concept of process control: that natural, "common-cause" variation is an essential part of the process, and you can't see natural variation in a table or in YTD comparisons. Tables or YTD values can supplement run or control charts, but should never be used without them. 

Above all, remember that the decisions you make in constructing SPC charts and associated data points (such as YTD figures) *will* impact the interpretation of the results. Bad charts can make for bad decisions. 

<br>  

## Defining *stability*

It is quite common for decision-makers to want to see KPIs displayed using a control chart.  However, control charts are only applicable for KPIs when the business goal is to keep that KPI stable. SPC tools are built upon the fundamental assumption of a *stable* process, and as an analyst you need to be very clear on what the definition of stability is in the context of the goals and the metric itself.  Because it takes time and resources to track KPIs (collecting the data, developing the dashboards, etc) you should take time to develop them carefully and ensure that SPC tools are, in fact, appropriate.

In many cases when folks talk about "stability" they mean "constant", and think of the goal behind the KPI as trying to keep the KPI at some fixed value or achieve some fixed target value. In many cases this makes sense, and a control chart would be appropriate. However, there are times where stability could have different meanings, particularly in a changing environment, and the KPI should be defined accordingly if a control chart is to be used. 

For example, perhaps some outpatient specialties are facing increasing numbers of referralsnot but are not getting more FTEs. With increasing patient demand and constrained hospital capacity, you would clearly not expect the situation (e.g., wait times for appointments)---and thus the data representing the process---to be constant over time. So, a KPI such as "percent of new patients seen within 2 weeks" might be a goal we care about, but we already would expect that value to decline. As such, it is not stable, and placing it in the context of a control chart would be useless. However, if we define the KPI as something like "percent of new patients seen within 2 weeks relative to what we would expect given increased demand and no expansion ", we have now placed it into a stable context. Instead of asking if the metric itself is declining, we're asking whether the system is responding as it has in the past. By defining the KPI in terms of something we would want to remain stable, we can now use a control chart to track its performance. 

For another example, perhaps complaints about phone wait time for a call center has led to an increase in FTEs to support call demand. You would expect the call center performance to improve, such as in terms of "percent of calls answered in under 2 minutes". So, what would a "stable” call center look like as they add FTEs and the previously-defined KPI improves over time? Perhaps it would be that the performance of the various teams within the call center become more similar (e.g., decreased variability across teams). Maybe it could be the frequency of catastrophic events (e.g., people waiting longer than X minutes, where X is very large) staying below some threshold---similar to a "downtime" KPI used to track the stability of computer systems. Maybe it could be something along the lines of the percent change in the previously-defined KPI tracking the percent change in FTEs (though we know this relationship is non-linear). The point is, we know performance as previously defined will be improving in the call center so tracking that in a control chart would be useless.  However, we still want to track "stability" of the process in the face of these improvements.  

In both cases, it would not be appropriate to use a control chart for the previously-defined performance metrics, because we do not expect them (or necessarily want them) to be stable.  However, by focusing on the process itself, we can define alternate KPIs that conform to the assumptions of a control chart.

*Stability* here means that the system is responding as we would expect to the changing environment and that the system is robust to adverse surprises from the environment. **KPIs should be specifically designed to track this.** Essentially any KPI should track whether the process or system is stable and robust, rather than focusing strictly on the outcome as defined by previous KPIs. 

Make sure that metrics meant to measure stability are properly designed from the outset before you spend large amounts of resources to develop and track them.