The other day, we looked at measures of central tendency. Today, we will look at measures of spread, and tomorrow, measures of shape . One way of looking at a measure of central tendency is as your best guess of what something will be. Measures of spread tell you how good that guess is, and measures of shape tell you how you are likely to be wrong.
There’s more, after the fold.
Statistics can be divided into two big areas: Descriptive statistics and inferential statistics. Descriptive statistics is about describing data, and inferential statistics is about making inferences from a sample to a population. Suppose, for instance, you were interested in the average income of adults in the USA. You can’t get the information on the whole population, so you take a sample. (We’ll get into ways to do this in a later diary). When you try to say things about the whole population based on your sample, that’s inferential statistics. When you are just talking about your sample, that’s descriptive statistics.
Sometimes, though, you do have the whole population. If you wanted to find the average SAT score in a class of students, you could ask everyone. Then you don’t need to infer anything.
(By the way, don’t get used to these terms being sensible. Statisticians often use familiar words in unfamiliar ways; in particular, when statisticians use the words significance, power, random, and confidence, they don’t mean exactly what they do in everyday discorse. Don’t blame me, I didn’t make up the terms).
OK, enough background. Let’s say you’ve collected the data on whatever it is you are interested in. There are often several things you are interested in. You are interested in what a typical person is like, and for this, the measures of central tendency are good. You can think of this as ways to formalize the idea of a best guess. But you are also interested in how good that guess is. For that, you need a measure of spread. There are several popular ones. By far the most common is the standard deviation. Others are the variance, range, and the interquartile range.
The standard deviation of a sample is gotten by
- Finding the mean
- Subtracting each value in your sample from the mean
- Squaring each of these
- Adding the result of step 3
- Dividing by n 6) Taking the square root of step 5
(As an aside, is there a way to type formulas here?)
For the variance, just leave out step 5.
The range is just the lowest value to the highest (it’s usually given as both numbers). The interquartile range requires first dividing the data into quartiles, which essentially means putting them into order, then taking the bottom quarter, the middle (which is the same as the median), and the top quarter. The interquartile range is the range from the first quartile to the third (if you remember percentiles, then the first quartile is the same as the 25%tile and the third quartile is the 75%tile).
Enough math. Those who want more formal definitions and examples can, of course see wikipedia or some such.
When is each of these good? Or bad?
Well, the standaard deviation is usually good for the cases where the mean is a good measure of central tendency (see yesterday’s diary). The variance is not used much in everyday reporting, it’s mostly used for further statistical work. The range is almost always useful, and easy to interpret, and the interquartile range ought to be used a lot more, because, once you understand it, it’s easy to interpret, and it gives a good sense of the spread.
Examples of when SD is better, and when the IQR or range is better. Briefly, if you think the mean is a good measure of central tendency, then usually the SD is a good measure of spread. If you use the median, then you often want the IQR and range in addition to (or even instead of) the SD. And, if there is no good measure of central tendency, there is likely to be no good measure of spread. Some concrete examples: If you wanted to know the average IQ of Boomanites, then (presuming you could get a good sample, which I will talk about in another diary) the mean would be a good measure of central tendency, and the SD a good measure of spread. IQ is normally distributed (we’ll get to that in another diary, too) (actually, there is evidence that IQ isn’t exactly normally distributed, but it’s close). OTOH, if you wanted to know about the income of people at the pond, then the median would be a good measure of central tendency, and, while the SD wouldn’t exactly be WRONG, I would want to look at IQR and range as well. Finally, if you wanted to look at the heights and weights of professional athletes (as a whole group) then no measure of CT would be really good, nor would any measure of spread, because the group is composed of people who are too different from one another.
I will be around all day, and checking in at various times, so ask questions (if you’ve got any)
what do you think of the statistical analysis of this?
I am no expert on archeology, nor on the Bible, nor any of this but…
It seems to me the real danger is not that, somehow, by chance, a tomb with all those names is some other family – that seems very unlikely, although I am not sure why the exact numbers were chosen.
Rather, the real danger is fakery.
Thanks plf ๐
The range is greatly affected by outliers, which should be taken into consideration. I would hypothesis we have a number of outliers here at the frog pond … (I really just wanted to use that terminology here … ;D )
As for formula characters, try using the character map on pc (start, programs, accessories, system tools…)
α … alpha
β … beta
Σ and σ … sigma
Μ and μ … mu
χ2 … chi square
This is fun!
Not sure that helps w/ the actual equations though.
Thanks for the tip re formulas!
plf515 – I want to thank you for your willingness to share your expertise. I find it delightful when people share their passions here – the best teachers, imo, are those who want to share what they have learned.
I am much in agreement with Alice from yesterday in that my mind gets a bit goofy and I find myself getting a bit silly, considering my anxiety over Mr. Bush and Iran, this is actually a relief.
You wrote: One way of looking at a measure of central tendency is as your best guess of what something will be.
So statistics are used for prediction? Probability? Possibility?
You wrote: Statistics can be divided into two big areas: Descriptive statistics and inferential statistics. Descriptive statistics is about describing data, and inferential statistics is about making inferences from a sample to a population.
I do hope you will write a diary expanding on this “background” – with lots of examples of who wants this information and how you see the information is used and misused.
You wrote: You are interested in what a typical person is like, and for this, the measures of central tendency are good. You can think of this as ways to formalize the idea of a best guess. But you are also interested in how good that guess is. For that, you need a measure of spread.
I don’t understand the above. Can you give me some example?
Now, I must confess that when I read about “standard deviation” I felt my math anxiety flare (though in comparison to my anxiety over BushCo and Iran it was a mere twinge).
Where did “n” come from? What is “n?”
What is “variance?”
The interquartile range requires first dividing the data into quartiles, which essentially means putting them into order, then taking the bottom quarter, the middle (which is the same as the median), and the top quarter. The interquartile range is the range from the first quartile to the third (if you remember percentiles, then the first quartile is the same as the 25%tile and the third quartile is the 75%tile).
Is a “quartile” a fourth? (looks like “quarter”) You describe a “bottom,” “middle,” and “top,” are there only three quartiles?
Will we we be tested on this? ๐
Thanks again for your time.
Thanks for asking questions!
<<<
So statistics are used for prediction? Probability? Possibility?
>>>
Certainly for prediction. I am not sure what you mean by probability or possibility.
<<<
I do hope you will write a diary expanding on this “background” – with lots of examples of who wants this information and how you see the information is used and misused.
>>>
Well…..could be almost anyone for almost anything!
If you (or others) have some areas in mind, let me know. Examples of analyses, e.g. from the newspaper, that you’d like to discuss….whatever.
<<<<
You wrote: You are interested in what a typical person is like, and for this, the measures of central tendency are good. You can think of this as ways to formalize the idea of a best guess. But you are also interested in how good that guess is. For that, you need a measure of spread.
I don’t understand the above. Can you give me some example?
>>>>
OK. Let’s say you are interested in the intelligence of the people in Congress. Let’s further suppose that you are willing to use IQ as a measure of intelligence (I know, that’s controversial, but for the sake of argument) and that you have IQs for all members of Congress.
The average tells you….well, the average :-). Any measure of central tendency tells you were a ‘typical’ congressperson is. The difference between measures is in how they define typical. But it would be one thing if the average IQ were 120 and all were with 10 points of that (i.e. the lowest was 110 and the highest 130). It would be another if the lowest were 80 and the highest 160.
If you had to GUESS at a member’s IQ, your best guess in either case would be 120. But let’s say you were making a bet that you would be within 5 points. In the first case, you’d be be more willing to bet than in the second.
<<<
Where did “n” come from? What is “n?”
What is “variance?”
>>>>
n is the sample size. The number of people for whom you have data.
variance is the square of the sd
A quartile is a fourth, you’re right about that. But there are 3 points that divide the fourths. Hmmmmm…..how to explain?
If you cut a piece of paper three times, you get four pieces. That’s how it works
No tests…… ๐