Wednesday, September 18, 2013

SA With Excel For Dummies (Notes: Standardizing Scores STANDARDIZE)

A number in isolation doesn't really tell a story.  In order to fully understand what a number means, you have to consider the process that produced it.  In order to compare one number to another, they both have to be on the same scale.

In some cases, like currency conversion, it's easy to figure out a standard.  In others, like temperature conversion or conversion into the metric system, a formula guides you.

When it's not all laid out for you, you can use the mean and the standard deviation to standardize scores that come from different processes.  The idea is to take a set of scores and use its mean a a zero-point and its standard deviation as a unit of measure.  Then you compare the deviation of each score from the mean to the standard deviation.  You're asking, "How big is a particular deviation relative to (something like) an average of all the deviations?"

To do this, you divide the score's deviation by the standard deviation.  In effect, you transform the score inotoanother kind of score.  The transformed score is called a standard score or a Z-Score.

Characteristics of Z-Scores

A z-score can be positive, negative or zero.  A negative z-score represents a score that's less than the mean and a positive z-score represents a score that's greater than the mean.  When the score is equal to the mean, the z-score is zero.

When you calculate the z-score for every score in the set, the mean of the z-scores is 0, and the standard deviation is 1.

After you do this for several sets of scores, you can legitimately compare a score from one set to a score from another. If the two sets have different means and different standard deviations, comparing without standardizing is like comparing apples with oranges.

Here's an example of using a z-score:

Who is the greatest home run hitter of all-time, Barry Bonds or Babe Ruth?  One way to do this is look at each player's best season and compare the two.  Bonds hit 73 home runs in 2001, and Ruth hit 60 in 1927.  On the surface Bonds appears to be the more productive hitter.

The year 1927 was very different from 2001, however.  Baseball went through huge changes in the intervening years, and player statistics reflect those changes.  A home run was harder to hit in the 1920s than in the 2000s.

Standard scores can help decide which season was better.  To standardize I took the top 50 home run hitters of 1927 and the top 50 from 2001.  I calculated the mean and standard deviation of each group, and then turned Ruth's 60 and Bonds 73 into z-scores.

The average is 1927 is 12.68 homers with a standard deviation of 10.49.  The average from 2001 is 37.02 homers with a standard deviation of 9.64.  Although the means differ greatly, the standard deviations are pretty close.

And the z-scores?

Ruth's is (60-12.68)/10.49 = 4.51

Bond's is (73-37.02)/9.64 - 3.73

Just to show how times have changed, Gherig hit 47 home runs in 1927 finishing 2nd for a z-score of 3.27.  In 2001, 47 home runs amounted ot a z-score of 1.04.

Another example:

One practical application of z-scores is the assignment of grades to exam scores.  Based on percentage scoring, instructors traditionally evaluate a score of 90 points or higher as an A, 80-89 as a B and so on.  Then they average scores from several exams together to assign a course grade.

Is that fair?  Just as a peso from Argentina is worth more than a peso from Chile, and a home run was harder to hit in 1927 than in 2001, is a point on one exam worth the same as a point on another?  Like Peso, isn't that just word magic?

Indeed it is.  A point on a difficult exam is, by definition, harder to come by than a point on an easy exam.  Because points might not mean the same thing from one exam to another, the fairest thing to do is convert scores from each exam into z-scores before averaging them.  That way, you're averaging numbers on a level playing field.  For example, on an exam where the mean is 65 and the standard deviation is 12, a score of 71 results in a z-score of .5.  On another exam, with a mean of 69, a score of 75 is equivalent to a z-score of .429.  Moral of the story: Numbers in isolation tell you very little.  You have to understand the process that produces them.

Excel's STANDARDIZE function calculates z-scores.

Pictured below is a set of exam scores along with their mean and standard devation.  I used AVERAGE and STDEVP to calculate the statistics.



































Here are the steps for calculating the Z-scores:

1.  The data are in C2:C22.  I selected D2 to hold the z-score from the score in C2.  Ultimately I'll autofill column D and line up all the z-scores next to the corresponding exam scores.

2.  From the Statistical Functions Menu, select STANDARDIZE.

3.  In the Functions Arguments dialog box, enter the appropriate values for the arguments.

First I entered the cell that holds the first exam score in to the X box (D2).  In the mean box, I entered the cell that holds the mean (C24).  It has to be in absolute reference format, so the entry is $C$24.  You can type it that way, or you select C24 and press F4.  In the Standard_Dev box, I entered the cell that holds the standard deviation (C24).  This also has to be in absolute reference format $C$24.

SA With Excel for Dummies (Notes: Other variation related functions DEVSQ and AVGDEV)

DEVSQ: Calculates the sum of the squared deviations from the mean (without dividing by N or N-1).

For the numbers 50, 47, 52, 46, and 45 that's 34.

Average Deviation: One more Excel function deals with deviations in a way other than squaring them.

The variance and standard deviation deal with negative deviations by squaring all the deviations before averaging them.  How about if you just ignore the minus signs.  This is called the absolute value of each deviation.

AVEDEV calculates the average absolute deviation of a group of numbers.  For the numbers above it's 2.4

This statistic is less complicated than variance or standard deviation but is rarely used.  Why?  Statisticians can't use it as the foundation for additional statistics you meet later.


Tuesday, September 17, 2013

SA W/ Excel For Dummies (Notes) : The missing functions STDEVIF and STDEVIFS

Here's a rule of thumb:  Whenever you present a mean, provide a standard deviation.  Use AVERAGE and STDEV.S or STDEV.P in tandem.

Remember that Excel offers two functions, AVERAGEIF and AVERAGEIFS for calculating means conditionally.  Two additional functions would have been helpful STDEVIF and STDEVIFS for calculating standard deviations conditionally.

Excel however doesn't provide these functions.  Shown below are a couple workarounds that enable you to calculate standard deviations conditionally.  The workarounds filter out data that meet a set of conditions, and then calculate the standard deviation of the filtered data.  The data below is from a fictional psychology experiment:
























Cell F19 presents the standard devation (STDEV.S) for the trials that displayed a circle and is equivalent to: =STDEVIF (Shape, "Circle", RT_msec if this function existed.

Cell H 19 (not pictured but 66.44) shows the standard deviation for a green square and is equivalent to:
=STDEVIFS (RT_msec, Color "Green", Shape, "Square" if that function existed.

So how to filter the data:

First a little logic

In order to proceed you have to know about two of Excel's logic functions: IF and AND.  You access them by clicking Formulas | Logical Functions

IF takes three arguments:

A logical condition to be satisfied
The action to take if the logical condition is satisfied (that is, if the value of the logical condition is TRUE)
An optional argument that specifies the action to take if the logical condition is not satisfied (that is, if the value of the logical condition is FALSE)

AND can take up to 255 arguments.  AND checks to see if all of its arguments meet each specified condition - that is, each condition is TRUE.  If they all do, AND returns the value TRIE.  If not, AND returns the value FALSE.

In this example, I use IF to set the value of a cell in Column F to the corresponding value in Column D if the value in the corresponding cell in Column C is a circle.  The formula is F2 is = IF (C2="Circle, D2, " ").  If this were a phrase it would be, "If the value in C2 is "Circle", then set the value of this cell to the value in D2.  If not, leave the cell blank.  Autofilling the next 15 cells of Column C yields the filtered data pictured above.

I use And along with IF for the cells in column H.  Each one holds the value from the corresponding cell in Column D if two conditions are true:

The value in the corresponding cell in Column B is "Green"
The value in the corresponding cell in Column C is "Square"

The formula for column H is = IF(AND(B2="Green", C2="Square"),D2," ")


SA with Excel for Dummies (Notes on Variation VAR.S and VARA, Standard Deviation)

The worksheet functions VAR.S and VARA calculate the sample variance.  The relationship between VAR.S and VARA is the same as the relationship between VAR.P and VARPA.  VAR.S ignores cells that contain logical values and text whereas VARA doesn't.

Standard Deviation:

After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements.  For example, if your original measurements are in inches, their variance is in squared inches.  This is because you square the deviations before you average them.

Often it's more intuitive if you have a variation statistic that's in the same units as the original measurements.  It's easy to turn variance into that kind of statistic, all you have to do is take the square root of variance.  This is called standard deviation.

Population Standard Deviation: The standard deviation of a population is the square root of the population variance.  For these measurements in inches: 50, 47, 52, 46, and 45, the population variance is 6.8 square inches.  The population standard deviation is 2.61 inches.

The Excel worksheet functions STDEV.P and STDEVPA calculate the population standard deviation.  STDEV.P ignores blank cells and text whereas STDEVPA takes it into account (blank cells as 0, FALSE as 0, and TRUE as 1).

Sample Standard Deviation:

The standard deviation of a sample, an estimate of the standard deviation of a population, is the square root of the sample variance.

For these measurements 50, 47, 52, 46, and 45 the sample population variance is 8.5 square inches (VAR.S).  The sample population standard deviation is 2.92 inches

Excel Functions STDEV.S and STDEVA calculate sample standard deviation.  STDEVA uses text and logical values in it's calculations.


Statistical Analysis with Excel For Dummies (Notes: Variation and Standard Deviation VAR,P and VARPA)

Variation: It's a kind of average of how much each number in a group differs from the group mean.

Several statistics are available for measuring variation.  All of them work the same way:  The larger the value of the statistic, the more the numbers differ from the mean and vice versa.

Suppose you measure the heights of a group of children and their heights are: 48, 48, 48, 48, and 48

Then you measure another group and find their heights are 50, 47, 52, 46, and 45

If you calculate the mean of each group, you'll find they're the same, 48 inches.  Just looking at the numbers tells you the two groups of heights are different though.

One way to show the dissimilarity between the two groups is to examine the deviations in each one.  Think of a "deviation" as the difference between a score and the mean of all the scores in a group.

Here's what I mean:





















One way to proceed is to average the deviations.  The average of the deviations is 0 in both sets of data though.

Averaging the deviations doesn't help you see a difference between the two groups because the average of deviations from the mean in any group of numbers is always zero.

The joker in the deck is the negative numbers.  The trick is to use something from Algebra: A minus times a minus is a plus.

You multiply each deviation times itself and then average the results, this gives you the squared deviation.

The Variance (The average of the squared deviation) in the second group is (4+1+16+4+9) = 34/5 = 6.8.

The variance of the first group is 0.

So to summarize, to calculate variance:

1.  Find all the deviations from the mean
2.  Square the deviations
3.  Add them all up and find the average

Excel's two worksheet functions, VAR.P and VARPA calculate population variance.

Start with VAR.P using the second set of data from above.

Using VAR.P function with data in cells 8-12, you get the result 6.8.  If you include blank cells this function ignores them unlike VARPA.

VARPA takes text and logical values into consideration and includes them in its variance calculation.  If a cell contains text, VARPA sees that cell as containing a zero.  If a cell contains the logical value FALSE, that's also zero.  If the cell contains TRUE, that's considered a logical value of 1.

It's important to note that sample variance is a little different.  If your set of numbers is a sample drawn from a large population, you're probably interested in using the variance of the sample to estimate the variance of the population.  The formula above for variance doesn't quite work as an estimate of the population variance. Although the sample mean works just fine as an estimate of the population mean, this doesn't hold true for variance.

The difference in calculating the sample variance is that instead of averaging the numbers (Step 3 above), you add them all up and divide by the number of numbers minus 1.  So in the example above it would be (4+1+16+4+9)/4 = 34/4 = 8.5

So, if these numbers 50, 47, 52, 46, and 45 are an entire population, their variance is 6.4.  If they're a sample drawn from a larger population, the best estimate of that population's variance is 8.5






Statistical Analysis With Excel For Dummies (Notes: Mode)

One more measure of central tendency is important.  This one is the score that occurs most frequently in a group of scores it's called the mode.

If two scores are tied for occurring most frequently, your set of data is bimodal.

If you have a set of scores that all occur equally, there's no mode.

Sometimes the mode is the most representative measure of central tendency.  Imagine a small company that consists of 30 consultants and two high ranking officers.  Each consultant has an annual salary of 40K.  Each officer has an annual salary of 250K.  The mean salary of this company is $53,125.

Does the mean give you a clear picture of the company's salary?  No really, you're better off if you consider the mode which is $40,000.

Use Excel's MODE.SNGL to find a single mode.  Use MODE.MULT if there's more than one mode.  If you don't know how many modes there are, use MODE.MULT.  The worst that could happen is you wind up with error message in some of the cells of the results array.


Monday, September 16, 2013

Statistical Analysis with Excel Notes (Trimmean & Geometric Mean, Harmonic Mean)

Trimmean:Trimming the Mean.  This is Excels way to eliminate statistical outliers from data.

How to use TRIMMEAN:

1.  Type the scores into a worksheet and select a cell for the result.

For this example, I put these numbers into cells B2 through B11:

500,280,550,540,525,595,620,1052,591,618

These scores might result from a psychology experiment that measures reaction time in milliseconds.  I selected B12 for the result.

2.  From the Statistical Functions menu, select TRIMMEAN.

3.  In the Function Arguments box, type the values for the arguments.

The data array goes into the data array box.  B2:B11
Next I have to identify the percent of scores I want to trim.  In the percent box I enter .2.  This tells TRIMMEAN to eliminate the extreme 20 percent of the scores before calculating the mean.  The extreme 20 percent means the highest 10 percent and the lowest 10 percent of scores.

The result is 567.375

**The label percent is a little misleading in the function arguments box.  You have to express the percent as a decimal.  So you enter .2 rather than 20 in the percent box if you want to trim the extreme 20 percent.

Geometric Mean:

Suppose you have a 2 year investment that yields 25% the first year and 75 percent the second year.  What's the average annual rate of return?

To answer that question, you might be tempted to find the mean of 25 and 75.  But that misses an important point: At the end of the first year, you multiply your investment by 1.25 - you don't add 1.25 to it.  At the end of the second year, you multiply the first year result by 1.75.  This kind of average is called a geometric mean.

In the example, the geometric mean is the square root of the product of the two numbers SQRT 1.25*1.75  For three numbers, the geometric mean is the cube root of the product of the three.

The Excel worksheet function GEOMEAN calculates the geometric mean of a group of numbers.

Harmonic Mean:

Supposed you drive from your house to your job at the rate of 40 miles an hour.  On the way home you drive from your job to your home at 60 miles an hour.  What's the average speed for the total time you're on the road?

It's not 50 MPH because you're on the road a different amount of time on each leg of the trip.  The formula for figuring this out is 1/average = Average of {1/40 + 1/60} = 1/48

The average is 48.  This kind of average is called a harmonic mean.  You can calculate it for any amount of numbers.  Just put each number in the denominator of a fraction with 1 as the numerator (reciprocal).  Add all the reciprocals together and take their average.

The Excel function for this is HARMEAN



Sunday, September 15, 2013

Statistical Analysis with Excel for Dummies (Notes: Finding Your Center- AverageIF)

If you look at a group of numbers and try to find one that's somewhere in the middle, you're dealing with that group's central tendency.  Central tendency comes in several flavors.

That statistical term for an average is mean.  A mean is the sum of a bunch of numbers divided by the amount of numbers you added up.  For example the mean of reading scores in WPM that are 56, 78, 45, 49, 55, and 62.  Are the sum of those numbers divided by 6 which is 57.5.

The commonly used abbreviation for "Number" is x.  And a typical abbreviation for "Amount of numbers added up" is N.  Another abbreviation stands for Sum of.  It's the uppercase Greek Letter for S pronounced Sigma.  The abbreviation for mean is "X bar", an X with a bar over it.  Here's what the formula looks like:

















The Excel function Average calculates mean.

AverageIF includes numbers in the average if a particular condition is met.
AverageIFS includes numbers in the average if more than one condition is met.

Here's the data we'll be working with for these formulas:



































I also defined names for each of the cells.

To calculate the overall average, it's = AVG (RT_msec)

To calculate the average of trials that displayed a circle,we use the following AverageIF Formula:

=AVERAGEIF(Shape, "Circle", RT_msec)

To calculate the average of the first 8 trials, the formula is =AVERAGEIF(Trial,"<9",RT_msec)

To calculate the average of recation times fast than 400 msec:

=AVERAGEIF(RT_msec, ">400", RT_msec)

To calculate the average of trials on which Color = Green and Shape = Square is:

AVERAGEIFS(RT_msec, Color, "Green", Shape, "Square")

**Note that RT_msec is the first argument in the AVERAGEIFS but the last argument in AVERAGEIF.


Friday, September 13, 2013

Statistical Analysis With Excel For Dummies (Notes: Bar Charts & Scatter Plots)

Bar Charts:  Excel's bar chart is a column chart laid on it's side.  This is the one that reverses the horizontal-vertical convention.  Here the vertical axis hold the independent variable, and it's referred to as the x-axis.  The horizontal axis is the y-axis and tracks the dependent variable.

Bar charts are used when you want to make a point about reaching a goal or about the inequities in attaining one.

The table below shows data on children's use of the internet at home for the year 2000. Percent means the percentage of children in each income group.



























To put this data into a bar chart, follow these steps:

1.  Select the data that go into the sheet (A1 through B8)
2.  Select Insert | Recommended Charts and choose Clustered Bar.
3.  Modify the Chart.  Add chart title, axis titles, and bold the font of the axis titles and axis numbers.  The easiest way to bold the titles is to select an element and type Ctrl-B.

This is what the completed chart looks like:






















ScatterPlots:  An important statistical technique called linear regression helps to determine the relationship between on variable X, and another variable Y.

The basis of the technique is a graph that shows individuals measured on both x and y.  The graph represents each individual as a point.  Because the points seem to scatter around the graph, the graph is called a scatterplot.

Suppose you're trying to find out how well a test of aptitude for sales predicts salespeople's productivity.  You administer the test to a sample of salespersons and tabulate how much money they make in commissions over a two month period.  Each person's pair or scores (test score and commissions) locates him or her within the scatter plot.  Here's the salesperson data:
































To create the scatterplot graph, follow these steps:

1.  Select the data that go into the chart.  Select B14 through C33.  Including B13 creates the same chart but with an incorrect title.  The cells in column A are just placeholders that organize the data.

2,  /Select Insert | Recommended Charts.  Select the 1st option, Scatter chart.

3.  Modify the chart.  Add a new title and axis titles and embolden the titles.

Pictured below is the resulting graph:

























Now right-click on any point in the scatterplot and choose "Add Trendline"  Select the Linear Radio button and click the bottom two check boxes (Display Equation on Chart and Display R-Squared Value on Chart).

The resulting trendline is really called a regression line.  The meaning of the equations will be discussed later.


Another application of a scatter chart is to create something like a line chart.  The conventional line chart works when the values on the x-axis are equally spaced.  Suppose your data looks like the values below (note the x-values are not equally spaced)





















Select the data and then Insert | Recommended Charts and Excel automatically chooses "Scatter with Straight Lines and Markers"

This is the resulting chart:


Thursday, September 12, 2013

Statistical Analysis with Excel for Dummies (Notes - Pie Charts & Line Charts & Sparklines)

Pie Charts: To show the percentages that make up one total, a pie chart gets the job done efficiently.

Here's how to create a pie chart using the data below:













2.  Select the data that go into the chart.  I want the names in column A and the data in column F.  The trick is to select column A (Cells A2 through A7) in the usual way and then press and hold the Ctrl key.  While holding this key, drag the cursor from F2 through F7.  Two non adjoining columns are selected.

3.  Select Insert | Recommended Charts and pick the Pie Chart.

4.  Modify the chart.  To add data labels to the slices, select the chart (not just one slice) and right-click.  On the pop-up menu, select Add Data Labels | Add Data Labels.  Select the other option, Add Data Callouts, for an eye-catching effect.  To change the data label color from black to white, select the data labels and work with the Font Color button in the Font area of the Home tab.  Finally you can pull the number out of the thinnest slice to make things easier to see, this is done by dragging it to it's new location.

This is what the pie chart with modifications looks like:
















Pie chart's are more digestible if they have fewer slices.  If you cut a pie chart too fine, you're likely to leave your audience with information overload.

Line Charts:

The preceding example focused on one column of data.  This example will focus on one row.  The idea is to trace the progress of one space-related industry across the years 1990-1994.  In this example, I graph the revenues from Satellite Services.

To do this:

1.  Select the data that go into the chart.  For this example, that's cells A3 through F3.  Yes I include the label.  You can't hold down the Ctrl key and select the additional cells this time (the top row for the X axis).  If you do that Excel thinks 1990, 1991, 1992, 1993, and 1994 are just another series of data points to plot on the graph.

2.  Click Insert | Recommended Chart Types.  This time, choose Line with Markers from the options.

3.  Modify the chart.

The line on the chart is hard to see.  Clicking the line and then selecting Design | Change Colors gives a set of colors for the line.  Choose black.

Next, add the titles for the chart and for the axis.  One way to add the axis titles is to use the set of tools that appears when you select the chart.  Labeled with a plus sign, the first tool enables you to add elements to the chart.  Clicking it and checking the Axis labels check box puts axis titles on the chart.  Then click an axis title, highlight the text, and type the new title.

To put the years on the x-axis, right click inside the chart to get a pop-up menu and click on Select Data.  In the box labeled Horizontal (Category) Axis Labels, clicking the Edit button opens the Axis Labels dialog box.  A blinking cursor in the Axis label range box shows it's ready for business.  Selecting cells B1 through F1 and clicking OK closes the dialog box.  Clicking OK closes the Select Data Source box and puts the years on the x-axis.

This is what the line chart looks like:




















Sparklines:  A sparkline is a tiny chart you can integrate into text or a table to quickly illustrate a trend.  It's designed to be the size of a word.

Three types of sparklines are available.  One is a line chart, another is a column chart.  The third is a special type of column chart that sports fans will like.  It shows wins and losses.

To show what they look like, I apply the first two to the following table:













To do this:

1.  Insert two columns between Column A and Column B.
2.  In the new blank column B, select cell B2.
3.  Select Insert | Sparklines |Line to open the Create Sparklines dialog box.
4.  In the data range box, enter D2:H2 and click ok.  Then autofill the column.

Repeat these steps for Column C but with the Sparklines Column button.  The results are below:










The Wins Losses sparkline nicely summarizes a sports team's progress throughout a season.  Created with the Wins Losses button in the Sparklines area, the sparklines shown below show the week-to-week progress of the teams in the NFL's NFC division in 2011.  In the data, 1 represents a win and -1 represents a loss.  A win appears as a marker above the line, a loss below the line.







*If you want to delete a sparkline, you don't delete in the usual way.  Instead, you right-click it and select Sparklines from the pop-up menu.  This presents a choice that allows you to clear the sparkline.

Wednesday, September 11, 2013

Statistical Analysis with Excel for Dummies (Notes) Charts (Clustered Column & Stacked Column Chart)

Some Fundamentals:

Excel uses the word "chart" instead of graph.  Most charts (pie chart being an exception) have a horizontal and a vertical axis.  The horizontal axis is called the x-axis and the vertical axis is called the y-axis.

What goes on the horizontal axis (x) is called the independent variable.  And what goes on the vertical axis (y) is called the dependent variable.  One of Excel's chart formats reverses that convention.  Excel calls that reversed-axis format a bar chart.

The terms independent and dependent imply that changes in the vertical direction depend (at least partly) on changes in the horizontal direction.

Inserting a Chart:

When you create a chart, you insert it into a spreadsheet  This immediately clues you that the chart creation tools are in the Charts area of the Insert tab.

Note:  In Excel a chart is dynamic.  This means that after you create a chart, changing its worksheet data results in an immediate change in the chart.

Creating a Clustered Column Chart:

1.  Enter your data into a worksheet (see image below)















2.  Select the data that go into the chart/  (A1:F7 above)

3.  In the Charts area of the Insert tab, select recommended charts.  The Clustered Column is the 5th chart down.

The image below is the resulting chart:





















To relocate the legend (the part below the x-axis that shows what the colors mean), right-click on the legend.  From the pop-up menu, select Format Legend.  This opens the Format Legend panel.  Select the Top Right Radio button to re-position the legend.

To format the axes, click on the y-axis and select the desired color.  Do the same for the x-axis.

To add the axis titles, select Design | Add Chart Element and Select Axis Titles

To add the chart title, click on the tile in the chart and type the new title.  When you add a title (whether axis or chart), you can either just select the title and start typing or you can highlight the title before you start typing.  If you type without highlighting, the new title appears in the formula bar and then in the title area after you press Enter.  If you highlight before you type, the title appears in the title area as you type.

This is what the chart looks like after the modifications:






















Stacked column chart:

If I had selected Excel's seventh recommended chart, I would have created a set of columns that presents the same information in a slightly different way.  Each column represents the total of all the data series at a  point on the x-axis.  Each column is divided into segments.  Each segment's size is proportional to how much it contributes to the total.

You can move a chart to a separate page in the workbook.  Select Design | Move Chart to open the Move Chart dialog box.  Select the New Sheet radio button to add a worksheet and move the chart there.  This is how the chart looks in it's new page:




























This is a nice way of showing percentage changes over the course of time.  If you just want to focus on percentages in one year, another type of graph is more effective (more later)/

Statisticians often use column charts to show how frequently something occurs.  For example, in a thousand tosses of a pair of dice, how many times does a 6 come up?  How many tosses result in a 7?

The x-axis shows each possible outcome of the dice tosses, and the heights of the columns show the frequencies.  Whenever the heights represent frequencies, your column chart is a histogram.

Excel provides a data analysis tool that does everything you need to create a histogram.  It's called, believe it or not, Histogram.  You provide an array of cells that hold all the data, like the outcome of many dice tosses, and an array that holds a list of intervals, like the possible outcomes of the tosses (2-12).  Histogram goes through the data array, counts the frequencies within each interval, and then draws the column chart.


Statiscal Analysis With Excel for Dummies (Notes) - Creating Array Formulas & Descriptive Statistics Tool

Creating your own Array Formulas:

In addtion to Excel's built-in array formulas, you can create your own.  The help things along you can incorporate named arrays.

The image below shows two named arrays, A and B.  If you want each cell in column K to hold the product of the corresponding  cell in column H and column I, follow the steps below the image:


















1.  Select the array that will hold the answers to the array formula.

2  Into the selected array, type the formula A*B

3.  Press Ctrl-Shift-Enter (Not Enter)

Note the formula in the formula bar has curly brackets which indicate an Array Formula.

Descriptive Statistics Tool:

The image below shows an array of numbers in cells B2 through B9 with a column header in B1.

























2.  Click Data | Data Analysis (all the way to the right) to open the Data Analysis dialog box.

3.  Click descriptive statistics and click ok.

4.  Identify the data array.  In the Input Range box, enter the cells that hold the data.  For this example, that's B1 through B9.

5.  Select the Columns radio button to indicate that the data are organized by columns.

6.  Select the Labels in First Row check box, because the Input Range includes the column heading

7.  Select the New Worksheet Ply radio button.  This tells Excel to create a new tabbed sheet within the current worksheet, and to send the results there.

8.  Click the Summary Statistics checkbox and leave others unchecked.  Click OK





































Tuesday, September 10, 2013

Statistical Analysis with Excel For Dummies (Notes) Array Functions, SUMIF, SUMIFS

Array Functions:

Most of Excel's built in functions are formulas that calculate a single value (like a sum) and put that value into a worksheet cell.  Excel has another type of function.  It's called an array function because it calculates multiple values and puts those values into an array of cells, rather than a single cell.

FREQUENCY is a good example of an array function.  It's job is to summarize a group of scores by showing how the scores fall into a set of intervals that you specify.  For example, given these scores: 77, 45, 44, 61, 52, 53, 68, 55 and these intervals: 50, 60, 70, 80.  Frequency shows how many are less than or equal to 50 (2 in this example), how many are greater than 50 and less than or equal to 60 (that would be 3), and so on.  The number of scores in each interval is called a frequency.  A table of the intervals and frequencies is called a frequency distribution.

**It's very important to press Ctrl+Shift+Enter to close the Function Arguments dialog box and put the values in the selected array.















Defining names for arrays of cells:

As you get more into Excel's statistical features, you work increasingly with formulas that have multiple arguments.  Oftentimes, these arguments refer to arrays of cells. If you apply meaningful names to these arrays, it helps you keep straight what your'e doing.

Excel gives an easy way to attach a name to a group of cells:

1.  Put a descriptive name at the top of a column (or to the left of a row) you want to name.
2.  Select the range of cells you want to name.
3.  Right-click on the selected range.
4.  From the pop-up menu, select Define Name
5.  Click OK

Four rules to follow from defining names are:

1.  Begin a name with an alphabetic character
2.  Make sure that the name contains no spaces or symbols.  Use an underscore to denote a space.
3.  Be sure the name is unique within the worksheet
4.  Be sure the name doesn't duplicate any cell reference in the worksheet.

After defining a name for an array of cells, you have the convenience of using the name in a formula.

To keep track of names in a worksheet, select Formulas | Name Manager

A couple of formulas that show just how convenient this naming capability is are SUMIF and SUMIFS.  These formulas add a set of numbers if specific conditions in one cell range (SUMIF) or in more than one cell range (SUMIFS) are met.

For example:



















To sum the data in Column C, but only for the North region.  That is, to only consider a cell in column C if the corresponding cell in column B contains "North".

1.  Select a cell for the formula result

2.  Select Formulas | Math & Trig | SUMIF,  SUMIF has three arguments.  The first, RANGE, is the range of cells to evaluate for the condition to include in the sum (North, South, East, or West in this example).  The second, Criteria, is the specific value in the Range (North, for this example).  The third, Sum_range, holds the value I sum.

3.  In the Function Arguments dialog box, enter the appropriate values for the arguments.

Here's where the another Defined Names button comes in handy.  In that Ribbon are (Formula tab), click the down arrow next to "Use in Formula" to open the drop-down list.  Selecting from that list fills in the Function Arguments dialog box.
















I had to type "North" into the Criteria box, Excel adds the double quotes.

4.  Click OK

In the formula bar, "=SUMIF(Regions, "North", Revenue_Million) appears.  It can be typed exactly that way into the formula bar, without the dialog box or the drop-down list.

Incidentally, the same cell range can be both the Range and the Sum_range.  For example, to sum just the cells for which Revenue_Million is less than 25, that's: =SUMIF(Revenue_Million, "<25", Revenue_Million)

The second argument (Criteria) is always in double quotes.

What about SUMIFs?  That one is useful if you want to find the sum of revenues for North but only for the years 2006 and 2007.  To find this follow these steps:

1.  Select a cell for the formula result
2.  Select Formulas | Math & Trig menu, SUMIFS
3.  Enter data as pictured below:




Statistics for Dummies Notes (Graphing Categorical and Numericall Data)

The most common types of data displays for categorical data are pie charts and bar graphs.

Pie Charts:  Takes categorical data and breaks them down by group, showing the percentage of individuals that fall into each group.  Because a pie chart takes on the shape of a circle, the "slices" that represent each group can easily be compared and contrasted.



Bar Chart: Like a pie chart, a bar graph breaks categorical data down by group.  Unlike a pie chart, it represents these amounts by using bars of different lengths; whereas a pie chart most often reports the amount in each group as percentages, a bar graph uses either the number of individuals in the group (also called the frequency) or the percentage in each group (called the relative frequency).



The most common types of data displays for numerical data are charts, histograms, and boxplots.

Histograms: A Special graph applied to data broken down into numerically ordered groups; for example, age groups such as 10-20, 21-30, 31-40, and so on.  The bars connect to each other in a histogram - as opposed to a bar graph for categorical data, where the bars represent categories that don't have a particular order and are separated.  The height of each bar of a histogram represents either the number of individuals (called frequency) in each group or the percentage of individuals (relative frequency) in each group.  Each individual in the data set falls into exactly one bar.

A histogram provides a snapshot of all the data broken down into numerically ordered groups, making it a quick way to get the big picture of the data, in particular, its general shape.

A histogram tells you three main features of numerical data:

- How the data are distributed among the groups (shape of the data)
- The amount of variability in the data (spread)
- Where the center of the data is (statisticians use different measures)



Boxplot: A one-dimensional graph of numerical data based on the five-number summary, which includes the minimum value, the 25th percentile (Q1), the median, the 75th percentile (Q3), and the maximum value.  In essence, these five descriptive statistics divide the data set into four parts; each part contains 25% of the data.

To make a boxplot:

1.  Find the five-number summary of your data set.
2.  Create a vertical (or horizontal) number line whose scale includes the numbers in the five-number summary and uses appropriate units of equal distance from each other.
3.  Mark the location of each number in the five-number summary just above the number line (for a horizontal boxplot) or just to the right of the number line (for a vertical boxplot).
4.  Draw a box around the marks for the 25th and 75th percentile.
5.  Draw a line in the box where the median is located.
6.  Determine whether or not outliers are present.
To make this determination, calculate the IQR (by subtracting Q3-Q1);then multiply by 1.5.  Add this amount to the value of Q3 and subtract this amount from Q1.  This gives you a wider boundary around the median than the box does. Any data points that fall outside this boundary are determined to be outliers.

7.  If there are no outliers, draw lines from the upper and lower edges of the box out to the minimum and maximum values in the data set.
8.  If there are outliers, indicate their location on the boxplot with *signs.  Instead of drawing a line from the edge of the box all the way to the most extreme outlier, stop the line at the last data value that isn't an outlier.




Monday, September 9, 2013

Statistics for Dummies (Notes on Number Crunching Basics) - Five Number Summary/Interquartile Range

Five Number Summary:

Beyond reporting a single measure of center and/or a single measure of spread, you can create a group of statistics and put them together to get a more detailed description of a data set.

The empirical rule uses the mean and standard deviation in tandem to describe a bell-shaped data set.  In the case where your data are not bell-shaped, you use a different set of statistics (based on percentiles) to describe the big picture of data.  This method involves cutting the data into four pieces (with an equal amount of data in each piece) and reporting the resulting five cutoff points that separate these pieces.  These cutoff points are represented by a set of five statistics that describe how the data are laid out.

The five numbers in a five-number summary are:

1.  The minimum (smallest) number int the data set.
2.  The 25th percentile (also known as the first quartile, or Q1)
3.  The median (50th percentile)
4.  The 75th percentile (also known as the third quartile, or Q3)
5.  The maximum (largest) number int the data set.

For example, suppose you want to find the five-number summary of the following 25 (ordered) exam scores:
43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.

1.  The minimum is 43.
2.  Multiply .25 *25 = 6.25 and round up to 7.  The 7th number from the right is 68.
3.  Multiply .5*25 = 12.5 and round up to 13.  The 13th number from the right is 77
4.  Multiply .75*25 = 18.75 and round up to 19.  The 19th number from the right is 89
5.  The maximum is 99

To best interpret a five-number summary you can use a boxlot.

Interquartile Range:

The purpose of the five-number summary is to give descriptive statistics for center, variation, and relative standing all in one shot.  The measure of center is the median, and the first quartile, median, and third quartiles are measures of relative standing.

To obtain a measure of variation based on the five-number summary, you can find what's call the interquartile range (or IQR).  The IQR equals Q3 minus Q1 (that is the 75th percentile minus the 25th percentile) and reflects the distance taken up by the innermost 50% of the data.  If the IQR is small, you know a lot of data are close to the median.  If the IQR is large, you know the data are more spread out from the median.  The IQR from the test scores data set is 96-68 = 21, which is fairly large seeing as how test scores only go from 1 to 100.

The interquartile range is a much better measure of variation than the regular range (maximum value minus minimum value because the IQR doesn't take outliers into account.  It focuses ont he distance within the middle 50% of the data.


Statistics for Dummies (Notes on Number Crunching Basics) - Median/Standard Deviation/Percentiles

Descriptive Statistics (Statistic for short) is a number that summarizes or describes some characteristic about a set of data.

How to find the Median:

1.  Order the numbers from smallest to largest
2.  If the data set contains an odd number of numbers, choose the one that is exactly in the middle.
3.  If the data set contains an even number of numbers, take the two numbers that appear in the middle and average them to find the median.

Standard Deviation: Measures how concentrated the data are around the mean.  The more concentrated, the smaller the standard deviation.  Often reported in parentheses: (s=2.68)

The steps for calculating Standard Deviation are:

1.  Find the average of the data set
2.  Take each number in the data set and subtract the mean from it
3.  Square each of the differences
4.  Add up all the results from 3 to get the sum of squares
5.  Divide the sum of squares by the number of numbers in the data set minus 1
6.  Take the square root

Example:

To find the standard deviation of 1, 3, 5, 7:

1.  The mean is 16/4 = 4
2.  -3, -1, 1, 3
3.  9, 1, 1, 9
4.  20
5.  20/3 = 6.67
6. 2.58

In statistics, the 68-95.99.7 rule, also knows as the three-sigma rule or empirical rule, states that nearly all values lie within 3 standard deviations of the mean in a normal distribution.

About 68.27% of the values like within 1 standard deviation of the mean.  Similarly, about 95.45% of the values lie within 2 standard deviations of the mean.

Percentiles: Measures where you stand compared to the rest of the herd.  The kth percentile is a number in the data set that splits the data into two pieces.  The lower piece contains k percent of the data, and the upper pieces contains 100-k percent.  The median is the 50th percentile.

To calculate the kth percentile (where k is any number between one and one hundred, do the following:

1.  Order all the numbers in the data set from smallest to largest
2.  Multiply k percent times the total number of numbers, n.
3a.  If your result from Step 2 is a whole number go to Step 4.  If the result from Step 2 is not a whole number, round it up to the nearest whole number and go to Step 3b.
3b.  Count the numbers in your data set from left to right (from the smallest to the largest number) until you reach the value indicated in Step 3a.  The corresponding value in your data set is the kth percentile.
4.  Count the numbers in your data set from left to right until you reach the one indicated by Step 2.  The kth percentile is the average of that corresponding value in your data set and the value that directly follows it.

For example, suppose you have 25 test scores, and in order from lowest to highest they look like this:
43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99

To find the 90th percentile for these ordered scores:
1.  Multiply 90% times the total number of scores = .9*25 = 22.5.  Round up to 23
2.  Counting from left to right, you go until you find the 23rd number in the data set which is 98.

* A percentile is not a percent; a percentile is a number (or the average of two numbers) in the data set that marks a certain percentage of the way through the data.  Suppose your score on the GRE was reported to be the 80th percentile.  This doesn't mean you answered 80% of the questions correctly.  It means that 80% of the students' scores were lower than yours and 20% higher.









Thursday, September 5, 2013

Statistics for Dummies (Notes on Statistical Terms)

My quest to master statistics naturally starts with the book "Statistics for Dummies".  My next several blog posts will be notes taken from this book. I'm also reading Statistical Analysis with Excel for Dummies and will be taking notes on that as well.

Notes:

Mean: Also known as average.  The mean is the sum of all the numbers divided by the total number of numbers.

Median: Another way to measure the center of a numerical data set. The median is the point at which there are an equal number of data points whose values lie above and below the median value.

Standard Deviation: The amount of variability (or spread) among the numbers in a data set.  As the term implies, a standard deviation is a standard (or typical) amount of deviation (or distance) from the average (or mean).  In very rough terms it's the average distance from the mean.

Standard Score: Represents the number of standard deviations above or below the mean (without caring what the standard deviation or mean actually are)

Distribution and Normal Distribution:  The distribution of a data set (or population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.  When a distribution of categorical data is organized, you see the number or percentage of individuals in each group.  When a distribution of numerical data is organized, they're often ordered from smallest to largest, broken into reasonably sized groups (if appropriate), and then put into graphs and charts to examine the shape, center, and amount of variability in the data.

One of the most well-known distributions is called the normal distribution, also known as the bell-shaped curve.  The normal distribution is based on numerical data that is continuous;its possible values lie on the entire real number line.  68% of the data are centered around the mean (giving you the middle part of the bell).  A graph of a normal distribution with mean 0 and standard deviation 1 is called the standard normal distribution or Z-distribution.  The standard normal distribution is useful for examining the data and determining statistics like percentiles, or the percentage of the data falling between two values.

Central Limit Theorem (CLT): Basically says that for non-normal data, your sample mean has an approximate normal distribution, no matter what the distribution of the original data looks like (as long as your sample size was large enough).  And it doesn't just apply to the sample mean, CLT is also true for other sample statistics, such as the sample proportion.

Z Values: If a data set has a normal distribution, and you standardize all the data to obtain standard scores, those standard scores are called Z-values.

Confidence Interval: When you take a sample statistic (such as the sample mean or sample percentage) and add/subtract a margin of error, you come up with a confidence interval.  A confidence interval represents a range of likely values for the population parameter, based on your sample statistic.  For example, suppose the average time it takes you to drive to work each day is 35 minutes, with a margin of error of plus or minus 5 minutes.  You estimate the average time to drive to work would be anywhere from 30 to 40 minutes.  This estimate is a confidence interval.

Hypothesis Test; A statistical procedure in which data are collected from a sample and measured against a claim about a population parameter.  For example, if a pizza delivery chain claims to deliver all pizzas within 30 minutes of placing the order, on average, you could test whether this claim is true by collecting a random sample of delivery times over a certain period and looking at the average delivery time for that sample.

The claim that's on trial in a hypothesis test is call the null hypothesis. Ho
If the null hypothesis is concluded to be untrue you would believe the alternative hypothesis. Ha

P-Values: All hypothesis tests ultimately use a p-value to weigh the strength of evidence (what the data are telling you about the population).  The p-value is a number between 0 and 1 and interpreted in the following way:

-A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, so you reject it

-A large p-value (>.05) indicates weak evidence against the null hypothesis, so you fail to reject it.

-P-values very close to the cutoff (.05) are considered to be marginal (could go either way)

Correlation: The extent to which two numerical variables have a linear relationship (that is, a relationship that increases or decreases at a constant rate).

Examples:

The number of times a cricket chirps per second is strongly related to temperature;when it's cold outside, they chirp less frequently, and as the temperature warms up, they chirp at a steadily increasing rate.  The number of cricket chirps and temperature have a strong positive correlation.

The number of crimes (per capita) has often been found to be related to the number of police officers in a given area.  When more police officers patrol the area, crime tends to be lower, and when fewer police officers are present in the same area, crime tends to be higher.  the number of police officers and the number of crimes have a strong negative correlation.

The consumption of ice cream (pints per person) and the number of murders in New York are positively correlated.  That is, as the amount of ice cream sold per person increases, the number of murders increases.

**Correlation isn't able to explain why or how the relationship between two variables , x and y, exists; only that it does exist.

Causation: States that a change in the value of the x variable will cause a change in the value of the y variable.






Wednesday, September 4, 2013

Starting From Scratch

This might be the most ambitious and potentially fool hearty undertaking of my life.  My goal is to become the world's best sports handicapper starting from scratch.  I have no statistics background.  I have no programming knowledge.  And I have little idea of how I"m going to get from knowing next to nothing about sports handicapping to being the best sports handicapper in the world.  But it won't stop me from trying.

I love sports and I love sports betting.  And quite honestly I'm tired of losing.  I'm willing to do whatever it takes to turn myself into a profitable handicapper.  These are my ideas right now and I realize I'll likely look back at this and find them laughable but nevertheless:

-Self educate myself on everything there is to know about statistics.  Essentially I want a degree in statistics through self education and without stepping foot in a classroom.

-Master Excel.  I need to figure out how to translate statistical sports data into betting ideas through Excel.  I realize I may want to use more sophisticated programs as my knowledge advances but for now I'd be happy with using Excel as my medium for translating sports statistics into profitable betting opportunities  Essentially I want to create sports models that predict the outcome of games or contests.

I must admit the primary reason for writing this blog is to store all my information in one place.  Initially I was going to use a pen and paper but that's much more time consuming and not as "pretty" as having everything written in a blog.  It should be easy to come back and reference things I've learned or forgotten this way.

A secondary reason for creating this blog is if in the extremely unlikely chance I'm successful in this insanely monstrous undertaking, that the framework for others to do the same will be there to see.  One thing about sports handicapping/modeling is that there's very little useful information out there.  And for good reason.  People don't want to share their secrets otherwise people will copy their methods and cut into their profits.  This may become a consideration for me at some point as well, but for now I'm not even remotely worried about it.

Enough talking, time to get to it!
-