Measures of Spread: Statistics #4

Here we discuss measures of spread, or dispersion, which we use to understand how well medians and means represent the data, and how reliable our conclusions are. They can help understand test scores, income inequality, spot stock bubbles, and plan gambling junkets. They're pretty useful, and now you're going to know how to calculate them.

Here we’re heading to the data on both sides of that middle. What statisticians call “measures of spread”.

Statistical measures of spread or dispersion tell us how data is spread around the middle.

That lets us know how well the mean or median represents the data.

And how much we trust conclusions based on the mean and median.

And how can we use it in our real life. Measures of spread are all around.

From test scores.

Like when you find out you scored in the 99th percentile on the LSAT!

Economists use measures of spread to study income inequality.

Investors use them to try to identify price bubbles. Gamblers use them to try to figure out how much they might win or lose. Pollsters use measures of spread to help calculate margins of error.

They come in up real life. And, heads up, there is some math coming your way. We’re not going to spend a lot of this series doing calculations but for this one it’s important.

INTRODUCTION:

Let’s do a thought experiment to compare measures of spread. We’ll talk about YouTube viewers and their ages. You’re a YouTuber, with big dreams and amazing content, but as a growing channel, you need to know more about your audience. YouTube will give you some information about this, usually in the form of a fancy chart. One of the pieces of information you could calculate is the Range of your audiences’ age.

Range:

“Range takes the largest number in our data set and subtracts the smallest number in the set to give us the distance between these two extremes”.

The larger the distance, the more “spread out” our data is.

With the range we’re able to quantify the distance between our most extreme points. We can often sense that groups are different, and our ranges confirm it. If we looked at the range of your audience’s age, we’d get a better idea of the full spectrum of the people who watch your content.

If you have 13 year olds watching you might want to limit the ‘adult’ content, but if you also have people over 40 watching, you may need to explain some of the slang. But the range won’t tell you about your *core* audience. These are the people who you appeal to the most.

This might be better summarized by the Inter Quartile Range (or IQR) which doesn’t consider extreme values.

Inter Quatile Range:

“The IQR looks at the spread of the middle 50% of your data”.

So in this example the ages of your audience. The IQR will give you a better idea who is the primary group watching you.

A lifestyle guru like Bethany Mota might have an IQR of 13-25, whereas I might guess someone like John Oliver has an IQR that’s older. Maybe in the range of 22-40. Their overall range could be similar. But the IQR gives us a better idea of their core audience. So let’s introduce some numbers so we can do some math.

Let’s say 10 basketball players have scored the following number of points in the first part of a game: 1, 3, 3, 4, 5, 6, 6, 7, 8, and 8. The median is 5.5.

That divides the data set into two halves.

To divide it further, into quarters, we find the median of each of those halves. Which are 3 and 7. Q1 and Q3 respectively.

The four quartiles here are from 1-to-3, 3-to-5.5, 5.5-to-7, and 7-to-8.

The IQR is the difference between Q3 - Q1. Or in this case 7-3...which is 4.

If the median is closer to one of the ends of the interquartile range, it means that quartile has a smaller range.

Since each quartile has the same number of data points, it means that for that quartile, the same amount of points are closer to each other.

But, we’re still losing a lot of information about how spread out all of the data is since only two of the data points are used to calculate both the range and interquartile range. There are measures of spread that include all of our data, just like the mean.

Variance:

Take the variance which can give us a better sense of how spread out the whole data set is.

Let’s take a scatterplot of all of our data points and draw a straight line across the graph at the mean then draw lines from each point straight down to the mean line. Those lines represent the deviation or difference from each point to the mean.

Now imagine a square with sides the length of the deviation line.

The area of all the squares for every point divided by the number of data points is the variance.

But it turns out that if you use this same formula to calculate the variance of a sample, it would be “biased”.

That is, the sample variance would consistently be a little smaller than the real variance of the population.

We divide by the number of samples minus 1 in order to get the sample variance to be unbiased or a better guess for the population variance.

For example, say that the Mets, the Yankees, The Angels, the Dodgers, and the Astros have 2, 2, 5, 8, and 8 wins each.

The mean number of wins for the group of teams is 5 (25/5).

To calculate the variance we take each number and subtract the mean, square this difference, then add all of these squared differences together and divide by the number of data points minus 1.

The variance of this set of baseball teams is 9+9+0+9+9 all divided by 4, which equals 9 squared wins.

And, yes, I know 9 squared wins don’t mean anything but when we square our numbers, we’re also squaring our units right along with them.

Even though squared wins aren’t an understandable unit to us, the variance is still a really useful number to have because it tells us how much “variability” is in our data. Our baseball example, it tells us roughly how far each team’s win record is from the mean. We’ll see it pop up quite often once we get to inferential statistics.

For now, let’s go to the thought bubble. Professor Hooch has hired you to analyze students’ broom speeds for the Hogwarts Quidditch teams.

There are fifteen new Gryffindors, so you measure how long it takes them to fly around the field twice. And here’s the plot of the times (in seconds) that it takes each student to complete the trip.

Looks like a few of the muggle born students who didn’t grow up using magic brooms took a lot longer than their classmates who grew up in wizarding families. Our mean of all students is 36.47 seconds but if we take out the muggle born students, the mean is down to 29.67 seconds

Means are very easily changed by extreme values. But the median does not change as much. It only goes from 30 seconds to 29.5 seconds when we pull out the muggle-borns. The range changes greatly, going from 46 seconds to 20 because the extreme values determine the high number in our range calculation.

The variance is also greatly affected since those slow Muggle-born students inflate our mean. If we take out those Muggle-born times, the rest of our data is quite close together, reflected in the variance of about 36 seconds squared. But once we add those times back in, the variance

shoots up to about 228 seconds squared, which matches our intuition that the group is now more “spread out”.

You can see that the distance between points and the new mean is much larger than before we put the Muggle-born times back in these Muggle-born times changing our measures of spread and center.

But that doesn’t necessarily mean these data are bad.

We need to think about whether unusual points belong in our data or not.

And we’ll talk more about unusual points, or Outliers, later in the series.

Remember that the units of variance are squared units like seconds squared for our flying broomstick times, or baseball wins squared for our baseball example. And yes, variance is valuable, but sometimes we need something with units that make just a little more sense.

Standard deviation:

“The standard deviation is the square root of the variance”.

Which gives back the units that we’re comfortable with seconds or baseball wins. The standard deviation of our Quidditch data would be about 6 seconds without the Muggle-born data and about 15 seconds with it.

You can think of the standard deviation as the average amount we expect a point to differ (or deviate) from the mean.

That means that on average, we expect students to deviate from the meantime by 6 seconds.

When the Muggle-born students raise our mean, our standard deviation goes up as well. In part, this happens because now the other points are further from the mean since the mean became larger.

Just like the mean, the standard deviation and variance are heavily affected by unusually large or small values.

So you should still always look out for extreme values in your data and be aware of the influence they can have.

If you see someone reporting a mean number in an article or on TV, you can use the standard deviation. If they’re thoughtful enough to give it to you to get a better understanding for how well the mean represents the data.

If the mean number of murders per state in 2015 was 307 (which it was), then a standard deviation of 10 murders shows us that 307 is a pretty good guess for the number of murders in any individual state.

But, if the the standard deviation was 353 murders (which it was), that guess wouldn’t be nearly as accurate.

And this makes some sense, you wouldn’t expect Montana to have nearly as many murders as a heavily populated state like New York or California.

Let’s go back to our YouTube channel. So now you have a better idea of who is watching you. And you’re getting more and more viewers everyday!

If you want to grow more, you realize you need to diversify your audience.

So you look at the standard deviation of the ages of your audience.

This will give you a better idea of whether your audience have similar ages, or whether you’re appealing to many age groups.

You keep adding new content and collaborating with other YouTubers to try to reach a wider audience, and it’s working!

Your standard deviation is getting larger, which means you’re attracting a more diverse (or more “spread out”) audience.

As our YouTube thought experiment showed us, the different measures of spread each give us different information about our data, but they all tell us something about how to spread out the data.

You can use measures of spread to grow your YouTube channel and these are important for statisticians. But they are also valuable for us non-statisticians to ponder. And I’m going to go a little deep here and try not to veer into the cheesy but here’s my big takeaway from this episode we all have a tendency to compare ourselves to the “average”.

We compare our income to the average income. We compare our rent to the average rent. Our intelligence to average intelligence. We compare our weight to the average weight of someone our age. And on. And on.

From these “measures of spread,” I take away the idea that the “average” whatever, on its own, can be deeply misleading.

Comparing ourselves to that single statistic can give us a false sense of failure or success. Depending on how the data is spread out.

So maybe stop comparing yourself to the average or, if you’re really insistent on ranking yourself against everybody else go calculate the standard deviation too.

Search This Blog

Statistics

Measures of Spread: Statistics #4

Comments

Post a Comment

Popular posts from this blog

Correlation Doesn’t Equal Causation: Statistics #8

Plots, Outliers: Data Visualization Part 2: Statistics #6

The Shape of Data: Distributions: Statistics #7