Statistics 101 Part 1

Just a little while ago, I asked in Froggy Bottom if there was interest in a series on Statistics. There was some! :-).  So, here’s the first entry.

All of these were originally on Daily Kos.

This series will not be for the statistical experts, it will be for those who want to be able to understand some basic statistics, without a lot of heavy-duty math.  I’ll try to emphasize aspects I think will be of interest to Kossacks, including how to tell when someone is misleading you with statistics. I welcome comments, suggestions, and thoughts both from people who are reading this as an intro to statistics and from the more statistically literate.

In today’s diary, I will discuss measures of central tendency.  See you after the fold.
There are various ways to classify variables.  One useful way is to distinguish between continuous and categorical data.  Data is continuous if it can (at least in theory) take on any number.  Data is categorical if it can only take on certain numbers.  For example, weight, income, age and IQ are continuous.  Political party, hair color, and marital status are categorical.

When you have continuous data, two things that you often want to know are “What values are likely?”  and “How spread out are the values?”  Today, we will look at the first question, which, in statistician’s language, is called central tendency.  The most common measure of central tendency is the mean, which is often called the average.  The other commonly quoted measure of central tendency is the median.  We’ll look at those two and a couple others.

The mean is probably  familiar.  Add up the numbers, divide by how many numbers there are, and you’ve got it.  So, for example, if the IQs of the people in your family are

155  (that would be you)
135   (your sister)
and
70   (her wingnut husband)
then the average is (155 + 135 + 70)/ 3 = 120

The median is the number that splits the data into two equal halfs, with half being higher, and half lower (there are slightly more technical definitions, but this will do for our purposes).

Two other, less commonly used measures are the mode and the trimmed mean.  The mode is the most common value, and the trimmed mean is the mean after you throw out some extreme values (typically the highest 10% and the lowest 10%).  

When do you want each?  When do you want to use none of them?

There are some situations where no measure works well.  The most common is when the data are multimodal.  That means that the data have common values that are separated by some uncommon values.  For example, if you had a bunch of athletes from different sports (basketball players, football players, and jockeys), and were intrested in their weights, then no measure of central tendency would be good.

But, more often, you want some measure of central tendency, and have to decide which one.

The mean is a bad choice if the data are skewed, which means that there are some extreme values.  One common example of this is income.  Some people make a whole lot more than the average person, but no one makes that much less.  For instance, if the average income in the USA is $30,000 per year (I made that up) then there are some people who make millions more than that, but the poorest people make $30,000 less.  When the data are skewed, the median and the trimmed mean are good choices.  (You don’t see the trimmed mean much, but it can be very useful).

The mode is sometimes also a good choice.  Suppose, for example, you are reporting on a country where nearly everyone is a peasant making almost nothing, and there are a few multibillionaires making a lot, and a few more people in the middle.  Like this

Income                   Number of people
$100 per year                 1,000,000
$1000 to $100,000 per year        10,000
More                                 500

then the mean would be distorted by the few people making  huge amounts, and the median would be distorted by the pople making a middle amount; the mode would be $100 per year, and that would be a good representation of the income.

Another thing that often goes wrong with the mean is to average things that can’t be averaged.  The most common is to average percentages.  This is a bad idea.  I can get into why if people ask, but this diary is already getting very long, so I will stop here and wait for questions, comments and so on. OK, people have asked for an explanation of why averaging percentages is bad, so here is one (with made up data). Suppose the vote in some political race is as follows: State Democrat Republcan Calif 60% 40% NY 65% 35% South Dakota 35% 65% Alaska 40% 60% (other states data too) If one averages the percentages, one would get 50% each, but that isn’t right. A percentage is a form of a fraction, and you have to add the numerators and denominators and then form a new percentage, that is, add up the NUMBER voting Dem and Repub. and then get the percentage from the total

Author: plf515

I am a statistician for a nonprofit research company and an independent consultant. Also an expert on nonverbal learning disabilities.