Measures of Spread: Statistics #4
Here we discuss measures of spread, or dispersion, which we use to understand how well medians and means represent the data, and how reliable our conclusions are. They can help understand test scores, income inequality, spot stock bubbles, and plan gambling junkets. They're pretty useful, and now you're going to know how to calculate them.
Here we’re heading to the data on
both sides of that middle. What statisticians call “measures of spread”.
Statistical measures of spread or
dispersion tell us how data is spread around the middle.
That lets us know how well the mean
or median represents the data.
And how much we trust conclusions
based on the mean and median.
And how can we use it in our real
life. Measures of spread are all around.
From test scores.
Like when you find out you scored in
the 99th percentile on the LSAT!
Economists use measures of spread to
study income inequality.
Investors use them to try to identify price bubbles. Gamblers use them to try to figure out how much they might win or lose. Pollsters use measures of spread to help calculate margins of error.
They come in up real life. And,
heads up, there is some math coming your way. We’re not going to spend a lot of
this series doing calculations but for this one it’s important.
INTRODUCTION:
Let’s do a thought experiment to compare measures of spread. We’ll talk about YouTube viewers and their ages. You’re a YouTuber, with big dreams and amazing content, but as a growing channel, you need to know more about your audience. YouTube will give you some information about this, usually in the form of a fancy chart. One of the pieces of information you could calculate is the Range of your audiences’ age.
Range:
“Range takes the largest number in our data set and subtracts the
smallest number in the set to give us the distance between these two extremes”.
The larger the distance, the more “spread
out” our data is.
With the range we’re able to
quantify the distance between our most extreme points. We can often sense that
groups are different, and our ranges confirm it. If we looked at the range of
your audience’s age, we’d get a better idea of the full spectrum of the people
who watch your content.
If you have 13 year olds watching
you might want to limit the ‘adult’ content, but if you also have people over
40 watching, you may need to explain some of the slang. But the range won’t
tell you about your *core* audience. These are the people who you appeal to the
most.
This might be better summarized by the Inter Quartile Range (or IQR) which doesn’t consider extreme values.
Inter Quatile Range:
“The IQR looks at the spread of the middle 50% of your data”.
So in this example the ages of your
audience. The IQR will give you a better idea who is the primary group watching
you.
A lifestyle guru like Bethany Mota
might have an IQR of 13-25, whereas I might guess someone like John Oliver has
an IQR that’s older. Maybe in the range of 22-40. Their overall range could be
similar. But the IQR gives us a better idea of their core audience. So let’s introduce
some numbers so we can do some math.
Let’s say 10 basketball players have
scored the following number of points in the first part of a game: 1, 3,
3, 4, 5, 6, 6, 7, 8, and 8. The median is 5.5.
That divides the data set into two
halves.
To divide it further, into quarters,
we find the median of each of those halves. Which are 3 and 7. Q1
and Q3 respectively.
The four quartiles here are from
1-to-3, 3-to-5.5, 5.5-to-7, and 7-to-8.
The IQR is the difference between Q3
- Q1. Or in this case 7-3...which is 4.
If the median is closer to one of
the ends of the interquartile range, it means that quartile has a smaller
range.
Since each quartile has the same
number of data points, it means that for that quartile, the same amount of
points are closer to each other.
But, we’re still losing a lot of
information about how spread out all of the data is since only two of the data
points are used to calculate both the range and interquartile range. There are
measures of spread that include all of our data, just like the mean.
Variance:
Take
the variance which can give us a better sense of how spread out the whole data
set is.
Let’s take a scatterplot of all of
our data points and draw a straight line across the graph at the mean then draw
lines from each point straight down to the mean line. Those lines represent the
deviation or difference from each point to the mean.
Now imagine a square with sides the
length of the deviation line.
The area of all the squares for
every point divided by the number of data points is the variance.
But it turns out that if you use
this same formula to calculate the variance of a sample, it would be “biased”.
That is, the sample variance would
consistently be a little smaller than
the real variance of the population.
We divide by the number of samples
minus 1 in order to get the sample variance to be unbiased or a better guess
for the population variance.
For example, say that the Mets, the
Yankees, The Angels, the Dodgers, and the Astros have 2, 2, 5, 8, and 8 wins
each.
The mean number of wins for the
group of teams is 5 (25/5).
To calculate the variance we take
each number and subtract the mean, square this difference, then add all of
these squared differences together and divide by the number of data points
minus 1.
The variance of this set of baseball
teams is 9+9+0+9+9 all divided by 4, which equals 9 squared wins.
And, yes, I know 9 squared wins
don’t mean anything but when we square our numbers, we’re also squaring our
units right along with them.
Even though squared wins aren’t an
understandable unit to us, the variance is still a really useful number to have
because it tells us how much “variability” is in our data. Our baseball
example, it tells us roughly how far each team’s win record is from the mean. We’ll
see it pop up quite often once we get to inferential statistics.
For now, let’s go to the thought
bubble. Professor Hooch has hired you to analyze students’ broom speeds for the
Hogwarts Quidditch teams.
There are fifteen new Gryffindors,
so you measure how long it takes them to fly around the field twice. And here’s
the plot of the times (in seconds) that it takes each student to complete the
trip.
Looks like a few of the muggle born
students who didn’t grow up using magic brooms took a lot longer than their
classmates who grew up in wizarding families. Our mean of all students is 36.47
seconds but if we take out the muggle born students, the mean is down to 29.67
seconds
Means are very easily changed by
extreme values. But the median does not change as much. It only goes from 30
seconds to 29.5 seconds when we pull out the muggle-borns. The range changes
greatly, going from 46 seconds to 20 because the extreme values determine the
high number in our range calculation.
The variance is also greatly
affected since those slow Muggle-born students inflate our mean. If we take out
those Muggle-born times, the rest of our data is quite close together, reflected
in the variance of about 36 seconds squared. But once we add those times back
in, the variance
shoots up to about 228 seconds
squared, which matches our intuition that the group is now more “spread out”.
You can see that the distance
between points and the new mean is much larger than before we put the
Muggle-born times back in these Muggle-born times changing our measures of spread
and center.
But that doesn’t necessarily mean
these data are bad.
We need to think about whether
unusual points belong in our data or not.
And we’ll talk more about unusual
points, or Outliers, later in the series.
Remember that the units of variance
are squared units like seconds squared for our flying broomstick times, or baseball wins squared for our baseball
example. And yes, variance is valuable, but sometimes we need something with
units that make just a little more sense.
Standard deviation:
“The standard deviation is the square root of the variance”.
Which gives back the units that we’re comfortable with seconds or baseball wins. The standard deviation of our Quidditch data would be about 6 seconds without the Muggle-born data and about 15 seconds with it.
You can think of the standard
deviation as the average amount we expect a point to differ (or deviate) from
the mean.
That means that on average, we
expect students to deviate from the meantime by 6 seconds.
When the Muggle-born students raise
our mean, our standard deviation goes up as well. In part, this happens because
now the other points are further from the mean since the mean became larger.
Just like the mean, the standard
deviation and variance are heavily affected by unusually large or small values.
So you should still always look out
for extreme values in your data and be aware of the influence they can have.
If you see someone reporting a mean
number in an article or on TV, you can use the standard deviation. If they’re
thoughtful enough to give it to you to get a better understanding for how well
the mean represents the data.
If the mean number of murders per
state in 2015 was 307 (which it was), then a standard deviation of 10 murders
shows us that 307 is a pretty good guess for the number of murders in any
individual state.
But, if the the standard deviation
was 353 murders (which it was), that guess wouldn’t be nearly as accurate.
And this makes some sense, you
wouldn’t expect Montana to have nearly as many murders as a heavily populated
state like New York or California.
Let’s go back to our YouTube
channel. So now you have a better idea of who is watching you. And you’re
getting more and more viewers everyday!
If you want to grow more, you
realize you need to diversify your audience.
So you look at the standard
deviation of the ages of your audience.
This will give you a better idea of
whether your audience have similar ages, or whether you’re appealing to many
age groups.
You keep adding new content and
collaborating with other YouTubers to try to reach a wider audience, and it’s
working!
Your standard deviation is getting
larger, which means you’re attracting a more diverse (or more “spread out”)
audience.
As our YouTube thought experiment
showed us, the different measures of spread each give us different information
about our data, but they all tell us something about how to spread out the data.
You can use measures of spread to
grow your YouTube channel and these are important for statisticians. But they are
also valuable for us non-statisticians to ponder. And I’m going to go a little
deep here and try not to veer into the cheesy but here’s my big takeaway from this
episode we all have a tendency to compare ourselves to the “average”.
We compare our income to the average
income. We compare our rent to the average rent. Our intelligence to average
intelligence. We compare our weight to the average weight of someone our age. And
on. And on.
From these “measures of spread,” I
take away the idea that the “average” whatever, on its own, can be deeply
misleading.
Comparing ourselves to that single
statistic can give us a false sense of failure or success. Depending on how the
data is spread out.
So maybe stop comparing yourself to
the average or, if you’re really insistent on ranking yourself against
everybody else go calculate the standard deviation too.
Comments
Post a Comment