Plots, Outliers: Data Visualization Part 2: Statistics #6
INTRODUCTION:
Dot plot:
A dot plot takes a histogram and replaces the solid bars which use their height to show the frequency with dots.
There’s one dot for each data point contained in the bar, so we can just count the number of dots to find out how many there are.
The dot plot for our olive oil data looks like this, unsurprisingly similar to the histogram for that data.
This gives us a nice way to explore the general shape of our data, but we still lose information about the individual data values, just like with the histogram.
Occasionally we WANT that extra information.
Stem and Leaf plot:
A stem and leaf plot is a cousin of the dot plot. It also gives us information about data and their frequencies by stacking objects on top of each other.
However, stem and
leaf plots use values from the raw data instead of dots.
So, we’ll turn our
Olive oil dot plot into a stem and leaf plot.
And no, I’m not going to explain my olive oil fixation. First, we need to split each data value into a stem, and a leaf. Stems are related to the “bins” or bars in a histogram or dot plot.
Take our dot plot for example each stack of dots might represent a range of 5oz, from 0-4 oz, 5-9oz, all the way up to a bar with all the data in the 80-84 oz range.
The stem for a “bin” of data is the digits that *all* the values in a “bin” have in common. For the 10-14 oz range, each value has a 1 at the beginning of the number so the stem is ‘1’. For the 80-84 oz range, the data all have an “8” at the beginning, so the stem would be ‘8’.
We can have larger stems too! If the data went all the way up to 2,006 oz, we could have a stem of “2-0-0”, but that’s probably too much for our olive oil example.
Now that we have all of our stems, we can add the leaves. Each stem, like in a real plant, can have multiple leaves. They’re stacked on top of each other so that the height of the stack shows you how frequently data appear in that bin, just like a dot plot. The actual “leaf” is the rest of the digits that are not in the “stem”.
If one of our data points is 13, and the “stem” for that range is 1, that takes care of the “1”, so the leaf is “3”.
Leaves appear in numerical order, from the stem out, so leaves that are smaller digits are closer to the stem. From a distance, stem and leaf plots look a lot like a dot plot or histogram. If you squint your eyes, the leaves almost look like bars or dots, but squinting them will allow you to see even more information than a histogram or dot plot will tell you.
You get to see what the individual values are and *how* they’re spread out within a bar.
Stem and leaf plots are usually flipped on their sides so that the stems are listed vertically, and the leaves extend horizontally.
Here’s a stem and leaf plot of the number of pieces of gum each of your extended family members has chewed in the last month.
Boxplots:
Boxplots use some of
our measures of central tendency and spread to visually display our data.
A boxplot is also called a “box-and-whiskers-plot” It has two major parts: the box and the whiskers. The box is a rectangle that stretches across the inter-quartile range of our data (from Q1-Q3).
At the median, there is a line splitting the rectangle into halves. If one of those halves is larger than the other, that quartile is more spread out.
Since each quartile has the same number of data points, the smaller the quartile, the less spread out that portion of the data is. Imagine the difference between fitting 20 clowns in a car and fitting 20 clowns in a regulation sized football field.
A same number of “clowns”, and more space to make balloon animals. Attached to either end of this box are the whiskers which help show the minimum and maximum of all the data, as long as it's within one and a half times the Interquartile range of the median. This value sets our “fences.”
We use one and a half times the Inter Quartile Range because most of the data will be within this range, especially if your data is normally distributed.
Most of the data will be inside the fences-- any data outside is flagged as a potential “outlier”.
It can be tempting to
think of outliers as data that’s “wrong” somehow, but that’s
not always the case. Values outside the fences are less likely than data near the boxplot, but they’re not impossible.
For example, It’s pretty unlikely that if you dial random numbers into your phone you’ll call is a Domino’s Pizza, but it is possible. Rare values do happen.
Keeping these
rare-but-possible values can be important.
When the local news shows you a boxplot of local rents and decides that the bottom 1000 rent values are “outliers”, the graph they display could be misleading. Those rents are real values that you could expect.
Taking them out will make your visualization less informative and might lead you to think that the average rent is higher than it actually is.
However, some values
that are flagged as “outliers” may not be expected in your data at all.
The problem is you may not always know the difference between a point that’s valid but rare and one that’s a mistake.
Remember, statistics is all about uncertainty.
When you make or see a data visualization it’s important to remember that its job is to actually give you information.
If it doesn’t do that, it's not worthwhile. Now, let’s go back to frequency plots and talk about one last method for visualizing quantitative data.
Cumulative frequency plot:
Cumulative Frequency Plots are like histograms but instead of the height of a bar telling you how much data is in that specific bin, it tells you how much data is in that bin and all previous bins. That’s why it’s called “cumulative.”
It’s the frequency of
all the points we’ve accumulated up to this point.
It’s like a small fish getting eaten by a bigger fish, which gets eaten by an even bigger fish, and so on. Each fish is now full of the fish it ate. And the fish that fish ate. And a side note. Your odds of being killed by a shark are about one in 3 point 7 million.
Back to our cumulative frequency plots, these plots have their moment to shine when we want to answer a question like “How many JT songs have 160 unique words or fewer?” The cumulative frequency plot looks like this:
Here’s the bar that answers our question. We could also get this information by counting all the songs in the bars that are 160 or less on our histogram, but that’s more work.
Comments
Post a Comment