The Shape of Data: Distributions: Statistics #7

When collecting data to make observations about the world it usually just isn't possible to collect all the data. So instead of asking every single person about student loan debt for instance we take a sample of the population, and then use the shape of our samples to make inferences about the true underlying distribution of our data. It turns out we can learn a lot about how something occurs, even if we don't know the underlying process that causes it. Here, we’ll also introduce the normal (or bell) curve and talk about how we can learn some really useful things from a sample's shape - like if an exam was particularly difficult, how often old faithful erupt, or if there are two types of runners that participate in marathons. 

Data visualization and different kinds of frequency plots--like dot plots and histograms tell us how frequently things occur in data we actually have.

But so far in this series, the data we have talked about usually isn’t all the data that exists. If I want to know about student loan debt in America, I am definitely not going to ask over 300 million Americans.

But maybe I can find the time to ask 2,000 of them. Samples and the shapes they give us are shadows of what all the data would look like. We collect samples because we think they’ll give us a glimpse of the bigger picture.

They’ll tell us something about the shape of all the data. Because it turns out we can learn almost everything we need to know about data from its shape.

INTRODUCTION:

Picture a histogram of every single person’s height. Now imagine the bars getting thinner, thinner, and thinner as the bins get smaller and smaller.

Till they are so thin that the outline of our histogram looks like a smooth line since this is a distribution of continuous numbers.


 And there’s an infinite possibility of heights. I am 1.67642… (and on and on) meters tall.

If we let our bars be infinitely small, we get a smooth curve, also known as the distribution of the data. A distribution represents all possible values for a set of data and how often those values occur.

Distributions can also be discreteLike the number of countries, people have visited. That means they only have a few set values--that they can take on.

These distributions look a lot more like the histograms we’re used to seeing.

Like a histogram, the distribution tells us about the shape and spread of data.

We can think of distributions as a set of instructions for a machine that generates random numbers.

Let’s say it generates the number of leaves on a tree. You may well be wondering why we’d have a tree-leaf-number generating machine.

The idea here is that EVERYTHING can generate data. It’s not just mechanical stuff. It’s leaves and animals and even people. The distribution is what specifies how the knobs and dials on our machine are set. Once the machine is set, every time there’s a new tree, the machine pops out a random number of leaves from the distribution.

It won’t be the same number each time though. That’s because it’s a random selection based on the information the knobs and dials tell us about our distribution of leaves. When we look at samples of data generated by our leaf machine, we’re trying to guess the shape of the distribution and how that machine’s knobs and dials are set.

But remember, samples of data are not all the data, so when we compare the shapes of two samples of data, we’re really asking whether the same distribution of these two machine settings could have produced these two different but sort of similar shapes.


If you got an especially expensive electricity bill last month, you may want to look at the histogram of your average daily energy consumption this month, and the same month last year side-by-side. It’s not that realistic to expect that you consumed energy at EXACTLY the same rate this month as you did the year before. There are probably some differences.

But your question is whether there’s enough difference to conclude that your energy-consuming behaviors have changed.

When we think about data samples as being just some of the data made using a certain distribution shape, it helps us compare samples in a more meaningful way. Because we know that the samples approximate some theoretical shape, we can draw connections between the sample and the theoretical machine that generated it, which is what we really care about.

While data come in all sorts of shapes, let’s take a look at a few of the most common, starting with the normal distribution.

Normal Distribution:

We mentioned the Normal distribution when we talked about the different ways to measure the center of data since the mean, median, and mode of a normal distribution are the same.


This tells us that the distribution is symmetric, meaning you could fold it in half and those halves would be the same, and that it’s unimodal, meaning there’s only one peak.






The shape of a normal distribution is set by two familiar statistics: the mean and standard deviation. The mean tells us where the center of the distribution is.

The standard deviation tells us how thin or squished the normal distribution is.

Since the standard deviation is the average distance between any point and the mean, the smaller it is the closer all the data will be to the mean. We’ll have a skinnier normal distribution.


Most of the data in the normal distribution--about 68%--is within 1 standard deviation of the mean on either side. 





Just like the quartiles in a boxplot, the smaller the range that 68% of the data has to occupy, the more squished it gets. Boxplots here’s what the boxplot for normally distributed data looks like.


The two halves of our box are exactly the same because the normal distribution is symmetric. You’ve probably seen the normal distribution in a lot of different places, it gets called a Bell Curve sometimes.

Attributes like IQ and the number of Fruit Loops you get in a box are approximately normally distributed. Normal distributions come up a lot when we look at groups of things like the total value rolled after 10 dice rolls or birth weights.

We’ll talk more about why the normal distribution is so useful in the future.

As we’ve seen in this series, data isn’t always normal or symmetric, often times it has some extreme values on one side making it a little bit skewed.

In a boxplot of data from a skewed distribution, the median will not usually split the box into two even pieces.


Instead, the side with the skewed tail will tend to be stretched out, and often, we’ll see a lot of outliers on that side. When we see those features in our sample of data, it suggests that the distribution that generated our data also has some kind of skewed tail. Skew can be a useful way to compare data.

For example, teachers often look at the distribution of scores on a test to see how difficult the test was. Really difficult tests tend to generate skewed scores, with most students doing pretty poorly and a few who still ace it.

Say we flashed pictures of 20 Pokémon and asked people to name them. Here are their grades. Or another sample from a test asking people to list all 195 countries. We can compare the shapes and centers of these two groups of tests, as well as any other notable features.



First of all, these two samples look pretty similar. Both have a right skew.

Both have a pretty low center, but the second test has a more extreme skew.

Bigger skewed tails usually mean that the data--and therefore the distribution has both a larger range and a bigger standard deviation than data with a smaller tail.

The standard deviation is higher because not only are extreme data further away from the mean, but they also drag the mean toward them, making most of the other points just a little further from the mean too.

While the direction of the skew tells you where most of the data is always on the opposite side of the skewed tail the extremeness of the skew can help you mentally compare the approximate measures of spread, like range and standard deviation.

But we compare the shapes of two samples in order to ask whether the shape of the distributions that generated them are different, or whether ONE shape could have randomly created both samples.

In terms of our machine analogy, we ask whether one machine with its knob settings could have spit out two sets of scores, one that looks like test A, and one that looks like test B. Answering that question gets complicated, but we’ll get there. Now that we’ve examined the tails, let’s look at the middle of some distributions.

Bimodal or Multimodal:

Almost all the distributions we’ve seen so far are unimodal--they only have one peak. But there are many times when data might have two or more peaks.

We call it bimodal or multimodal data. And it looks like the back of a camel, or maybe like two of our unimodal distributions pasted side by side.


And that’s probably what’s happening with the unimodal distributions, not the camel thing. Often when you see multimodal data in the world, it’s because there are two different machines with two different distributions that are generating data that is being measured for some reason or other.

One possible example of this is the length of minutes that the geyser Old Faithful erupts. Most eruptions last either about 2 minutes or about 4 minutes, with few eruptions around the 3-minute mark, giving us a bimodal distribution.

It’s entirely possible that there are two different mechanisms behind the data, even though they’re being measured together.

For example, one set of conditions may lead to an eruption that’s about 2 minutes long, and another--maybe a different temperature or latency--leads to a different kind of eruption that lasts on average 4 minutes.

Since these two potentially different types of eruptions are being measured together, the data look like they come from one distribution with two bumps, but it is likely that there are two unimodal distributions being measured at the same time.


Another example that you don’t need to be a geologist to understand is the race times for some marathons. While this data may look like it comes from a unimodal distribution, in reality there’s two big groups of people who run a marathon: those that are competing, and those that do it to prove they can do it.



There’s usually one peak around the time that all the professional runners cross the finish line, and another when the amateurs do. While we don’t know for sure that bimodal data is secretly two distributions disguised as one, it is a good reason to look at things more closely.

Uniform Distribution:

Each value in a uniform distribution has the same frequency, just like each number on a die has exactly the same chance of being rolled.

When you need to decide something fairly, like which of your 6 roommates has to do the dishes tonight, or which friend to take to the Jay-Z concert the best thing you can do is use something like a die that has a uniform distribution.

That gives everyone an equal chance of being picked. And you can have uniform distributions with any number of outcomes. There are 20-sided dice.

When you’re in Vegas playing a round of roulette the ball is equally likely to land

in any of the 38 slots. There’s a difference between the shape of all the data, and the shape of a sample of the data.

When we talk about a uniform distribution, we’re talking about the settings of that data-generating machine, it doesn’t mean that every sample or even most samples of our data will have exactly the same frequency for each outcome.

It’s entirely possible that rolling a die 60 times results in a sample shaped like this:


Even if we know the theoretical distribution looks like this:



Using statistics allow us to take the shape of samples that has some randomness and uncertainty and make a guess about the true distribution that created that sample of data.

Statistics is all about making decisions when we’re not sure. It allows us to look at the shape of 60 dice rolls and figure out whether we believe the die is fair or whether the die is loaded or whether we need to keep rolling.

Whether it’s finding the true distribution of eruption times at Old Faithful or showing evidence that a company is discriminating based on age, gender, or race. The shape of data gives us a glimpse into the true nature of what is happening in the world.





Comments