Control Charts in Tableau, Or How I Learned to Quit Worrying and Love XmR

by Mark Jurries II for The Data Marks

How often have you been in a situation where the new month rolls around and you see your KPI has gone up from prior month and been immediately asked “what broke” (if up is bad) or seen the celebratory dances and dreading the inevitable decline the next month (if up is good). Control charts are natural answer to this, since it shows an expected range of outcomes and plots where the value lands in there.

Usually, you’ll see limits calced using 2 or 3 standard deviations. This is appropriate if your data is normally distributed, but real life bell curves are rare, albeit beautiful, creatures. For the following, you may modify the calculations to use standard deviation if you want, but we’ll be using XmR charts. For more background, read Donald Wheeler’s “Understanding Variation” or Stacy Barr’s overview of XmR signals. We’ll be using the criteria outlined in her article to use Tableau to identify the following signals:

  1. Items outside of control
  2. Short runs (three consecutive items closer to the limits than the average)
  3. Long runs (7 consecutive points on the same side of the line)

When we say “out of control”, we mean “above the upper control limit or below the lower control limit”.  Any resemblance to 1990s submarine movies is coincidental.

We’ll be using data from Fangraphs to chart home runs from 1990 to 2017. Go ahead and drag season to columns and HR to rows, then create the following custom calculations:

Average: WINDOW_AVG(SUM(HR))

Moving Range: ABS(ZN(SUM(HR)) – LOOKUP(ZN(SUM(HR)), -1))

Average Moving Range: WINDOW_AVG([Moving Range])

UCL: WINDOW_AVG(SUM(HR))+[Average Moving Range]*2.66

LCL: WINDOW_AVG(SUM(HR))-[Average Moving Range]*2.66

We won’t worry too much about the moving range calculations right now – they’re usually used to measure the movement between points, but we won’t be graphing them today. They’re used to calculate our upper and lower limits, which is simply the average + the moving range * 2.66. Note these are table calcs, that’ll be important later on.

Finally, let’s graph this out.

  • Drag Average, UCL, and LCL to your details mark.
  • Add a reference line for average.
  • Add a reference band for Min UCL to Max UCL.

You should have something like the following:

Sweet! We can see home runs were lower in the early 90s, shot up late in the decade (something about steroids), and declined from about 2000 to 2014 with a fairly sharp increase in recent years. It’s pretty obvious what falls outside the lines, but it’s nice to have a calc to do it for us, so let’s do that:

Is Outside of Control : SUM(HR) > [UCL] OR SUM(HR) < [LCL]

Add HR to rows again, change mark type to circle, set as dual axis and sync the axes, and drag our new calc to colors.

Now we’re getting somewhere! 2017 stands out, as well it should – it was a record year for home runs! The early 90s also show as outside of control, but that hardly seems fair, since clearly things were different then. We’ll revisit that shortly. For now, we want to move on to finding our long runs. Create the following calcs:

Long Run Side of Line: IF SUM(HR) > WINDOW_AVG(SUM(HR)) THEN ‘Above’ ELSE ‘Below’ END

Long Run Length: IF LOOKUP([Long Run Side of Line],-1) = [Long Run Side of Line] THEN PREVIOUS_VALUE(0) + 1 ELSE 1 END

Is In Long Run: WINDOW_MAX([Long Run Length],0,6) >= 7

The first calc simply checks if it’s above or below average. “Long Run Length” compares the side of line to the previous season, adding 1 to the value if there’s a match and resetting to 1 otherwise. So 1990 would start at 1, then 1991 would say “I’m below average, and hey, so is 1990 and it has a 1, so I’m going to return 1+1”.  Finally, we want to identify all items in the run, so we use WINDOW_MAX to lookup the highest run length in the next 6 seasons and see if any are above 7. Remember, the current season counts as “0”, so we want to look at 0, 1, 2, 3, 4, 5, and 6. This will highlight the start of the run and any items in it, and since it only looks 6 years ahead we don’t have to worry about it picking up another run. Let’s put “Is In Long Run” on colors and see what happens:

Well, that certainly picks up the steroid era. Notice that 2017 isn’t highlighted – it’s out of control, but we’re not looking for that here and it’s not part of a long run.

Finally, we can work out the short runs:

Short Run Side of Line: IF SUM(HR) > (WINDOW_AVG(SUM(HR))+[UCL])/2 THEN ‘Above’
ELSEIF SUM(HR) < (WINDOW_AVG(SUM(HR))+[LCL])/2 THEN 'Below'
ELSE ‘In Range’ END

Short Run Length: IF LOOKUP([Short Run Side of Line],-1) = [Short Run Side of Line]
AND [Short Run Side of Line] != ‘In Range’
THEN PREVIOUS_VALUE(0) + 1 ELSE 1 END

WINDOW_MAX([Short Run Length],0,2) >= 3

The logic here is similar to long runs, but we had to make some adjustments to see if the total home runs count was closer to the upper or lower limits than the average, which we did by averaging the numbers. We also added an “in range” option to the if statement, which we look at when determining the short run length so effectively, any item in range will have a run length of 1. We really don’t care if there are 3 items in a row in range. Let’s put this in colors:

1999 to 2001 was the height of the McGuire/Sosa/Bonds home run drive. If you don’t know anything about baseball, first, congratulations on making it this far, and secondly, all three men hit a lot of home runs, likely with some chemical assistance. Our chart is doing a pretty good job of picking up things like this. Finally, we want to highlight all three possible signals.

Highlight: IF [Is Outside of Control] THEN ‘Out of Control’
ELSEIF [Is In Short Run] THEN ‘Short Run’
ELSEIF [Is In Long Run] THEN ‘Long Run’
ELSE ‘Normal Range’ END

Ah, good. That’s useful. Now we can split things up into more distinct eras based on when we see changed. We see the long run begin in 1998, which is when the Diamondbacks and Devil Rays (now just Rays) were added. More players = more home runs. So we’ll set a block there. We also see a change in 2007, so we’ll update that as well:

Era: IF [Season] < 1998 THEN '1. Pre Expansion'
ELSEIF [Season] < 2007 THEN '2. Steroid Era'
ELSE ‘3. Current Era’ END

Drag this to columns, scratch your head, it’s showing the same band for all three eras. What’s up with that?

We have our eras defined, but need to update our table calcs. They’re still using table – across, and thus are calculating across the whole table instead of the pane like we want. Right click average, LCL, UCL, and Highlight and tell it to calculate using pane – across instead.

Much better. Now we see that home runs begin to sneak up already pre-expansion, stayed inflated for several years, and dropped again but are rebounding to levels that would be considered high in the steroid age. There are multiple theories about why this is, from a change in the ball, hitters trying to elevate the ball more, and steroids (again). You’ll also notice that we have fewer points highlighted – segmenting the eras correctly shows this more cleanly. We’ll add teams into the final dashboard so we can see how teams perform compared to their own standards.

Leave a Reply

Your email address will not be published. Required fields are marked *