A number in isolation doesn't really tell a story. In order to fully understand what a number means, you have to consider the process that produced it. In order to compare one number to another, they both have to be on the same scale.
In some cases, like currency conversion, it's easy to figure out a standard. In others, like temperature conversion or conversion into the metric system, a formula guides you.
When it's not all laid out for you, you can use the mean and the standard deviation to standardize scores that come from different processes. The idea is to take a set of scores and use its mean a a zero-point and its standard deviation as a unit of measure. Then you compare the deviation of each score from the mean to the standard deviation. You're asking, "How big is a particular deviation relative to (something like) an average of all the deviations?"
To do this, you divide the score's deviation by the standard deviation. In effect, you transform the score inotoanother kind of score. The transformed score is called a standard score or a Z-Score.
Characteristics of Z-Scores
A z-score can be positive, negative or zero. A negative z-score represents a score that's less than the mean and a positive z-score represents a score that's greater than the mean. When the score is equal to the mean, the z-score is zero.
When you calculate the z-score for every score in the set, the mean of the z-scores is 0, and the standard deviation is 1.
After you do this for several sets of scores, you can legitimately compare a score from one set to a score from another. If the two sets have different means and different standard deviations, comparing without standardizing is like comparing apples with oranges.
Here's an example of using a z-score:
Who is the greatest home run hitter of all-time, Barry Bonds or Babe Ruth? One way to do this is look at each player's best season and compare the two. Bonds hit 73 home runs in 2001, and Ruth hit 60 in 1927. On the surface Bonds appears to be the more productive hitter.
The year 1927 was very different from 2001, however. Baseball went through huge changes in the intervening years, and player statistics reflect those changes. A home run was harder to hit in the 1920s than in the 2000s.
Standard scores can help decide which season was better. To standardize I took the top 50 home run hitters of 1927 and the top 50 from 2001. I calculated the mean and standard deviation of each group, and then turned Ruth's 60 and Bonds 73 into z-scores.
The average is 1927 is 12.68 homers with a standard deviation of 10.49. The average from 2001 is 37.02 homers with a standard deviation of 9.64. Although the means differ greatly, the standard deviations are pretty close.
And the z-scores?
Ruth's is (60-12.68)/10.49 = 4.51
Bond's is (73-37.02)/9.64 - 3.73
Just to show how times have changed, Gherig hit 47 home runs in 1927 finishing 2nd for a z-score of 3.27. In 2001, 47 home runs amounted ot a z-score of 1.04.
Another example:
One practical application of z-scores is the assignment of grades to exam scores. Based on percentage scoring, instructors traditionally evaluate a score of 90 points or higher as an A, 80-89 as a B and so on. Then they average scores from several exams together to assign a course grade.
Is that fair? Just as a peso from Argentina is worth more than a peso from Chile, and a home run was harder to hit in 1927 than in 2001, is a point on one exam worth the same as a point on another? Like Peso, isn't that just word magic?
Indeed it is. A point on a difficult exam is, by definition, harder to come by than a point on an easy exam. Because points might not mean the same thing from one exam to another, the fairest thing to do is convert scores from each exam into z-scores before averaging them. That way, you're averaging numbers on a level playing field. For example, on an exam where the mean is 65 and the standard deviation is 12, a score of 71 results in a z-score of .5. On another exam, with a mean of 69, a score of 75 is equivalent to a z-score of .429. Moral of the story: Numbers in isolation tell you very little. You have to understand the process that produces them.
Excel's STANDARDIZE function calculates z-scores.
Pictured below is a set of exam scores along with their mean and standard devation. I used AVERAGE and STDEVP to calculate the statistics.
Here are the steps for calculating the Z-scores:
1. The data are in C2:C22. I selected D2 to hold the z-score from the score in C2. Ultimately I'll autofill column D and line up all the z-scores next to the corresponding exam scores.
2. From the Statistical Functions Menu, select STANDARDIZE.
3. In the Functions Arguments dialog box, enter the appropriate values for the arguments.
First I entered the cell that holds the first exam score in to the X box (D2). In the mean box, I entered the cell that holds the mean (C24). It has to be in absolute reference format, so the entry is $C$24. You can type it that way, or you select C24 and press F4. In the Standard_Dev box, I entered the cell that holds the standard deviation (C24). This also has to be in absolute reference format $C$24.
My Humble Quest To Become The World's Best Sports Handicapper
Wednesday, September 18, 2013
SA With Excel for Dummies (Notes: Other variation related functions DEVSQ and AVGDEV)
DEVSQ: Calculates the sum of the squared deviations from the mean (without dividing by N or N-1).
For the numbers 50, 47, 52, 46, and 45 that's 34.
Average Deviation: One more Excel function deals with deviations in a way other than squaring them.
The variance and standard deviation deal with negative deviations by squaring all the deviations before averaging them. How about if you just ignore the minus signs. This is called the absolute value of each deviation.
AVEDEV calculates the average absolute deviation of a group of numbers. For the numbers above it's 2.4
This statistic is less complicated than variance or standard deviation but is rarely used. Why? Statisticians can't use it as the foundation for additional statistics you meet later.
For the numbers 50, 47, 52, 46, and 45 that's 34.
Average Deviation: One more Excel function deals with deviations in a way other than squaring them.
The variance and standard deviation deal with negative deviations by squaring all the deviations before averaging them. How about if you just ignore the minus signs. This is called the absolute value of each deviation.
AVEDEV calculates the average absolute deviation of a group of numbers. For the numbers above it's 2.4
This statistic is less complicated than variance or standard deviation but is rarely used. Why? Statisticians can't use it as the foundation for additional statistics you meet later.
Tuesday, September 17, 2013
SA W/ Excel For Dummies (Notes) : The missing functions STDEVIF and STDEVIFS
Here's a rule of thumb: Whenever you present a mean, provide a standard deviation. Use AVERAGE and STDEV.S or STDEV.P in tandem.
Remember that Excel offers two functions, AVERAGEIF and AVERAGEIFS for calculating means conditionally. Two additional functions would have been helpful STDEVIF and STDEVIFS for calculating standard deviations conditionally.
Excel however doesn't provide these functions. Shown below are a couple workarounds that enable you to calculate standard deviations conditionally. The workarounds filter out data that meet a set of conditions, and then calculate the standard deviation of the filtered data. The data below is from a fictional psychology experiment:
Cell F19 presents the standard devation (STDEV.S) for the trials that displayed a circle and is equivalent to: =STDEVIF (Shape, "Circle", RT_msec if this function existed.
Cell H 19 (not pictured but 66.44) shows the standard deviation for a green square and is equivalent to:
=STDEVIFS (RT_msec, Color "Green", Shape, "Square" if that function existed.
So how to filter the data:
First a little logic
In order to proceed you have to know about two of Excel's logic functions: IF and AND. You access them by clicking Formulas | Logical Functions
IF takes three arguments:
A logical condition to be satisfied
The action to take if the logical condition is satisfied (that is, if the value of the logical condition is TRUE)
An optional argument that specifies the action to take if the logical condition is not satisfied (that is, if the value of the logical condition is FALSE)
AND can take up to 255 arguments. AND checks to see if all of its arguments meet each specified condition - that is, each condition is TRUE. If they all do, AND returns the value TRIE. If not, AND returns the value FALSE.
In this example, I use IF to set the value of a cell in Column F to the corresponding value in Column D if the value in the corresponding cell in Column C is a circle. The formula is F2 is = IF (C2="Circle, D2, " "). If this were a phrase it would be, "If the value in C2 is "Circle", then set the value of this cell to the value in D2. If not, leave the cell blank. Autofilling the next 15 cells of Column C yields the filtered data pictured above.
I use And along with IF for the cells in column H. Each one holds the value from the corresponding cell in Column D if two conditions are true:
The value in the corresponding cell in Column B is "Green"
The value in the corresponding cell in Column C is "Square"
The formula for column H is = IF(AND(B2="Green", C2="Square"),D2," ")
Remember that Excel offers two functions, AVERAGEIF and AVERAGEIFS for calculating means conditionally. Two additional functions would have been helpful STDEVIF and STDEVIFS for calculating standard deviations conditionally.
Excel however doesn't provide these functions. Shown below are a couple workarounds that enable you to calculate standard deviations conditionally. The workarounds filter out data that meet a set of conditions, and then calculate the standard deviation of the filtered data. The data below is from a fictional psychology experiment:
Cell F19 presents the standard devation (STDEV.S) for the trials that displayed a circle and is equivalent to: =STDEVIF (Shape, "Circle", RT_msec if this function existed.
Cell H 19 (not pictured but 66.44) shows the standard deviation for a green square and is equivalent to:
=STDEVIFS (RT_msec, Color "Green", Shape, "Square" if that function existed.
So how to filter the data:
First a little logic
In order to proceed you have to know about two of Excel's logic functions: IF and AND. You access them by clicking Formulas | Logical Functions
IF takes three arguments:
A logical condition to be satisfied
The action to take if the logical condition is satisfied (that is, if the value of the logical condition is TRUE)
An optional argument that specifies the action to take if the logical condition is not satisfied (that is, if the value of the logical condition is FALSE)
AND can take up to 255 arguments. AND checks to see if all of its arguments meet each specified condition - that is, each condition is TRUE. If they all do, AND returns the value TRIE. If not, AND returns the value FALSE.
In this example, I use IF to set the value of a cell in Column F to the corresponding value in Column D if the value in the corresponding cell in Column C is a circle. The formula is F2 is = IF (C2="Circle, D2, " "). If this were a phrase it would be, "If the value in C2 is "Circle", then set the value of this cell to the value in D2. If not, leave the cell blank. Autofilling the next 15 cells of Column C yields the filtered data pictured above.
I use And along with IF for the cells in column H. Each one holds the value from the corresponding cell in Column D if two conditions are true:
The value in the corresponding cell in Column B is "Green"
The value in the corresponding cell in Column C is "Square"
The formula for column H is = IF(AND(B2="Green", C2="Square"),D2," ")
SA with Excel for Dummies (Notes on Variation VAR.S and VARA, Standard Deviation)
The worksheet functions VAR.S and VARA calculate the sample variance. The relationship between VAR.S and VARA is the same as the relationship between VAR.P and VARPA. VAR.S ignores cells that contain logical values and text whereas VARA doesn't.
Standard Deviation:
After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements. For example, if your original measurements are in inches, their variance is in squared inches. This is because you square the deviations before you average them.
Often it's more intuitive if you have a variation statistic that's in the same units as the original measurements. It's easy to turn variance into that kind of statistic, all you have to do is take the square root of variance. This is called standard deviation.
Population Standard Deviation: The standard deviation of a population is the square root of the population variance. For these measurements in inches: 50, 47, 52, 46, and 45, the population variance is 6.8 square inches. The population standard deviation is 2.61 inches.
The Excel worksheet functions STDEV.P and STDEVPA calculate the population standard deviation. STDEV.P ignores blank cells and text whereas STDEVPA takes it into account (blank cells as 0, FALSE as 0, and TRUE as 1).
Sample Standard Deviation:
The standard deviation of a sample, an estimate of the standard deviation of a population, is the square root of the sample variance.
For these measurements 50, 47, 52, 46, and 45 the sample population variance is 8.5 square inches (VAR.S). The sample population standard deviation is 2.92 inches
Excel Functions STDEV.S and STDEVA calculate sample standard deviation. STDEVA uses text and logical values in it's calculations.
Standard Deviation:
After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements. For example, if your original measurements are in inches, their variance is in squared inches. This is because you square the deviations before you average them.
Often it's more intuitive if you have a variation statistic that's in the same units as the original measurements. It's easy to turn variance into that kind of statistic, all you have to do is take the square root of variance. This is called standard deviation.
Population Standard Deviation: The standard deviation of a population is the square root of the population variance. For these measurements in inches: 50, 47, 52, 46, and 45, the population variance is 6.8 square inches. The population standard deviation is 2.61 inches.
The Excel worksheet functions STDEV.P and STDEVPA calculate the population standard deviation. STDEV.P ignores blank cells and text whereas STDEVPA takes it into account (blank cells as 0, FALSE as 0, and TRUE as 1).
Sample Standard Deviation:
The standard deviation of a sample, an estimate of the standard deviation of a population, is the square root of the sample variance.
For these measurements 50, 47, 52, 46, and 45 the sample population variance is 8.5 square inches (VAR.S). The sample population standard deviation is 2.92 inches
Excel Functions STDEV.S and STDEVA calculate sample standard deviation. STDEVA uses text and logical values in it's calculations.
Statistical Analysis with Excel For Dummies (Notes: Variation and Standard Deviation VAR,P and VARPA)
Variation: It's a kind of average of how much each number in a group differs from the group mean.
Several statistics are available for measuring variation. All of them work the same way: The larger the value of the statistic, the more the numbers differ from the mean and vice versa.
Suppose you measure the heights of a group of children and their heights are: 48, 48, 48, 48, and 48
Then you measure another group and find their heights are 50, 47, 52, 46, and 45
If you calculate the mean of each group, you'll find they're the same, 48 inches. Just looking at the numbers tells you the two groups of heights are different though.
One way to show the dissimilarity between the two groups is to examine the deviations in each one. Think of a "deviation" as the difference between a score and the mean of all the scores in a group.
Here's what I mean:
One way to proceed is to average the deviations. The average of the deviations is 0 in both sets of data though.
Averaging the deviations doesn't help you see a difference between the two groups because the average of deviations from the mean in any group of numbers is always zero.
The joker in the deck is the negative numbers. The trick is to use something from Algebra: A minus times a minus is a plus.
You multiply each deviation times itself and then average the results, this gives you the squared deviation.
The Variance (The average of the squared deviation) in the second group is (4+1+16+4+9) = 34/5 = 6.8.
The variance of the first group is 0.
So to summarize, to calculate variance:
1. Find all the deviations from the mean
2. Square the deviations
3. Add them all up and find the average
Excel's two worksheet functions, VAR.P and VARPA calculate population variance.
Start with VAR.P using the second set of data from above.
Using VAR.P function with data in cells 8-12, you get the result 6.8. If you include blank cells this function ignores them unlike VARPA.
VARPA takes text and logical values into consideration and includes them in its variance calculation. If a cell contains text, VARPA sees that cell as containing a zero. If a cell contains the logical value FALSE, that's also zero. If the cell contains TRUE, that's considered a logical value of 1.
It's important to note that sample variance is a little different. If your set of numbers is a sample drawn from a large population, you're probably interested in using the variance of the sample to estimate the variance of the population. The formula above for variance doesn't quite work as an estimate of the population variance. Although the sample mean works just fine as an estimate of the population mean, this doesn't hold true for variance.
The difference in calculating the sample variance is that instead of averaging the numbers (Step 3 above), you add them all up and divide by the number of numbers minus 1. So in the example above it would be (4+1+16+4+9)/4 = 34/4 = 8.5
So, if these numbers 50, 47, 52, 46, and 45 are an entire population, their variance is 6.4. If they're a sample drawn from a larger population, the best estimate of that population's variance is 8.5
Several statistics are available for measuring variation. All of them work the same way: The larger the value of the statistic, the more the numbers differ from the mean and vice versa.
Suppose you measure the heights of a group of children and their heights are: 48, 48, 48, 48, and 48
Then you measure another group and find their heights are 50, 47, 52, 46, and 45
If you calculate the mean of each group, you'll find they're the same, 48 inches. Just looking at the numbers tells you the two groups of heights are different though.
One way to show the dissimilarity between the two groups is to examine the deviations in each one. Think of a "deviation" as the difference between a score and the mean of all the scores in a group.
Here's what I mean:
One way to proceed is to average the deviations. The average of the deviations is 0 in both sets of data though.
Averaging the deviations doesn't help you see a difference between the two groups because the average of deviations from the mean in any group of numbers is always zero.
The joker in the deck is the negative numbers. The trick is to use something from Algebra: A minus times a minus is a plus.
You multiply each deviation times itself and then average the results, this gives you the squared deviation.
The Variance (The average of the squared deviation) in the second group is (4+1+16+4+9) = 34/5 = 6.8.
The variance of the first group is 0.
So to summarize, to calculate variance:
1. Find all the deviations from the mean
2. Square the deviations
3. Add them all up and find the average
Excel's two worksheet functions, VAR.P and VARPA calculate population variance.
Start with VAR.P using the second set of data from above.
Using VAR.P function with data in cells 8-12, you get the result 6.8. If you include blank cells this function ignores them unlike VARPA.
VARPA takes text and logical values into consideration and includes them in its variance calculation. If a cell contains text, VARPA sees that cell as containing a zero. If a cell contains the logical value FALSE, that's also zero. If the cell contains TRUE, that's considered a logical value of 1.
It's important to note that sample variance is a little different. If your set of numbers is a sample drawn from a large population, you're probably interested in using the variance of the sample to estimate the variance of the population. The formula above for variance doesn't quite work as an estimate of the population variance. Although the sample mean works just fine as an estimate of the population mean, this doesn't hold true for variance.
The difference in calculating the sample variance is that instead of averaging the numbers (Step 3 above), you add them all up and divide by the number of numbers minus 1. So in the example above it would be (4+1+16+4+9)/4 = 34/4 = 8.5
So, if these numbers 50, 47, 52, 46, and 45 are an entire population, their variance is 6.4. If they're a sample drawn from a larger population, the best estimate of that population's variance is 8.5
Statistical Analysis With Excel For Dummies (Notes: Mode)
One more measure of central tendency is important. This one is the score that occurs most frequently in a group of scores it's called the mode.
If two scores are tied for occurring most frequently, your set of data is bimodal.
If you have a set of scores that all occur equally, there's no mode.
Sometimes the mode is the most representative measure of central tendency. Imagine a small company that consists of 30 consultants and two high ranking officers. Each consultant has an annual salary of 40K. Each officer has an annual salary of 250K. The mean salary of this company is $53,125.
Does the mean give you a clear picture of the company's salary? No really, you're better off if you consider the mode which is $40,000.
Use Excel's MODE.SNGL to find a single mode. Use MODE.MULT if there's more than one mode. If you don't know how many modes there are, use MODE.MULT. The worst that could happen is you wind up with error message in some of the cells of the results array.
If two scores are tied for occurring most frequently, your set of data is bimodal.
If you have a set of scores that all occur equally, there's no mode.
Sometimes the mode is the most representative measure of central tendency. Imagine a small company that consists of 30 consultants and two high ranking officers. Each consultant has an annual salary of 40K. Each officer has an annual salary of 250K. The mean salary of this company is $53,125.
Does the mean give you a clear picture of the company's salary? No really, you're better off if you consider the mode which is $40,000.
Use Excel's MODE.SNGL to find a single mode. Use MODE.MULT if there's more than one mode. If you don't know how many modes there are, use MODE.MULT. The worst that could happen is you wind up with error message in some of the cells of the results array.
Monday, September 16, 2013
Statistical Analysis with Excel Notes (Trimmean & Geometric Mean, Harmonic Mean)
Trimmean:Trimming the Mean. This is Excels way to eliminate statistical outliers from data.
How to use TRIMMEAN:
1. Type the scores into a worksheet and select a cell for the result.
For this example, I put these numbers into cells B2 through B11:
500,280,550,540,525,595,620,1052,591,618
These scores might result from a psychology experiment that measures reaction time in milliseconds. I selected B12 for the result.
2. From the Statistical Functions menu, select TRIMMEAN.
3. In the Function Arguments box, type the values for the arguments.
The data array goes into the data array box. B2:B11
Next I have to identify the percent of scores I want to trim. In the percent box I enter .2. This tells TRIMMEAN to eliminate the extreme 20 percent of the scores before calculating the mean. The extreme 20 percent means the highest 10 percent and the lowest 10 percent of scores.
The result is 567.375
**The label percent is a little misleading in the function arguments box. You have to express the percent as a decimal. So you enter .2 rather than 20 in the percent box if you want to trim the extreme 20 percent.
Geometric Mean:
Suppose you have a 2 year investment that yields 25% the first year and 75 percent the second year. What's the average annual rate of return?
To answer that question, you might be tempted to find the mean of 25 and 75. But that misses an important point: At the end of the first year, you multiply your investment by 1.25 - you don't add 1.25 to it. At the end of the second year, you multiply the first year result by 1.75. This kind of average is called a geometric mean.
In the example, the geometric mean is the square root of the product of the two numbers SQRT 1.25*1.75 For three numbers, the geometric mean is the cube root of the product of the three.
The Excel worksheet function GEOMEAN calculates the geometric mean of a group of numbers.
Harmonic Mean:
Supposed you drive from your house to your job at the rate of 40 miles an hour. On the way home you drive from your job to your home at 60 miles an hour. What's the average speed for the total time you're on the road?
It's not 50 MPH because you're on the road a different amount of time on each leg of the trip. The formula for figuring this out is 1/average = Average of {1/40 + 1/60} = 1/48
The average is 48. This kind of average is called a harmonic mean. You can calculate it for any amount of numbers. Just put each number in the denominator of a fraction with 1 as the numerator (reciprocal). Add all the reciprocals together and take their average.
The Excel function for this is HARMEAN
How to use TRIMMEAN:
1. Type the scores into a worksheet and select a cell for the result.
For this example, I put these numbers into cells B2 through B11:
500,280,550,540,525,595,620,1052,591,618
These scores might result from a psychology experiment that measures reaction time in milliseconds. I selected B12 for the result.
2. From the Statistical Functions menu, select TRIMMEAN.
3. In the Function Arguments box, type the values for the arguments.
The data array goes into the data array box. B2:B11
Next I have to identify the percent of scores I want to trim. In the percent box I enter .2. This tells TRIMMEAN to eliminate the extreme 20 percent of the scores before calculating the mean. The extreme 20 percent means the highest 10 percent and the lowest 10 percent of scores.
The result is 567.375
**The label percent is a little misleading in the function arguments box. You have to express the percent as a decimal. So you enter .2 rather than 20 in the percent box if you want to trim the extreme 20 percent.
Geometric Mean:
Suppose you have a 2 year investment that yields 25% the first year and 75 percent the second year. What's the average annual rate of return?
To answer that question, you might be tempted to find the mean of 25 and 75. But that misses an important point: At the end of the first year, you multiply your investment by 1.25 - you don't add 1.25 to it. At the end of the second year, you multiply the first year result by 1.75. This kind of average is called a geometric mean.
In the example, the geometric mean is the square root of the product of the two numbers SQRT 1.25*1.75 For three numbers, the geometric mean is the cube root of the product of the three.
The Excel worksheet function GEOMEAN calculates the geometric mean of a group of numbers.
Harmonic Mean:
Supposed you drive from your house to your job at the rate of 40 miles an hour. On the way home you drive from your job to your home at 60 miles an hour. What's the average speed for the total time you're on the road?
It's not 50 MPH because you're on the road a different amount of time on each leg of the trip. The formula for figuring this out is 1/average = Average of {1/40 + 1/60} = 1/48
The average is 48. This kind of average is called a harmonic mean. You can calculate it for any amount of numbers. Just put each number in the denominator of a fraction with 1 as the numerator (reciprocal). Add all the reciprocals together and take their average.
The Excel function for this is HARMEAN
Subscribe to:
Comments (Atom)