My quest to master statistics naturally starts with the book "Statistics for Dummies". My next several blog posts will be notes taken from this book. I'm also reading Statistical Analysis with Excel for Dummies and will be taking notes on that as well.
Notes:
Mean: Also known as average. The mean is the sum of all the numbers divided by the total number of numbers.
Median: Another way to measure the center of a numerical data set. The median is the point at which there are an equal number of data points whose values lie above and below the median value.
Standard Deviation: The amount of variability (or spread) among the numbers in a data set. As the term implies, a standard deviation is a standard (or typical) amount of deviation (or distance) from the average (or mean). In very rough terms it's the average distance from the mean.
Standard Score: Represents the number of standard deviations above or below the mean (without caring what the standard deviation or mean actually are)
Distribution and Normal Distribution: The distribution of a data set (or population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group. When a distribution of numerical data is organized, they're often ordered from smallest to largest, broken into reasonably sized groups (if appropriate), and then put into graphs and charts to examine the shape, center, and amount of variability in the data.
One of the most well-known distributions is called the normal distribution, also known as the bell-shaped curve. The normal distribution is based on numerical data that is continuous;its possible values lie on the entire real number line. 68% of the data are centered around the mean (giving you the middle part of the bell). A graph of a normal distribution with mean 0 and standard deviation 1 is called the standard normal distribution or Z-distribution. The standard normal distribution is useful for examining the data and determining statistics like percentiles, or the percentage of the data falling between two values.
Central Limit Theorem (CLT): Basically says that for non-normal data, your sample mean has an approximate normal distribution, no matter what the distribution of the original data looks like (as long as your sample size was large enough). And it doesn't just apply to the sample mean, CLT is also true for other sample statistics, such as the sample proportion.
Z Values: If a data set has a normal distribution, and you standardize all the data to obtain standard scores, those standard scores are called Z-values.
Confidence Interval: When you take a sample statistic (such as the sample mean or sample percentage) and add/subtract a margin of error, you come up with a confidence interval. A confidence interval represents a range of likely values for the population parameter, based on your sample statistic. For example, suppose the average time it takes you to drive to work each day is 35 minutes, with a margin of error of plus or minus 5 minutes. You estimate the average time to drive to work would be anywhere from 30 to 40 minutes. This estimate is a confidence interval.
Hypothesis Test; A statistical procedure in which data are collected from a sample and measured against a claim about a population parameter. For example, if a pizza delivery chain claims to deliver all pizzas within 30 minutes of placing the order, on average, you could test whether this claim is true by collecting a random sample of delivery times over a certain period and looking at the average delivery time for that sample.
The claim that's on trial in a hypothesis test is call the null hypothesis. Ho
If the null hypothesis is concluded to be untrue you would believe the alternative hypothesis. Ha
P-Values: All hypothesis tests ultimately use a p-value to weigh the strength of evidence (what the data are telling you about the population). The p-value is a number between 0 and 1 and interpreted in the following way:
-A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, so you reject it
-A large p-value (>.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
-P-values very close to the cutoff (.05) are considered to be marginal (could go either way)
Correlation: The extent to which two numerical variables have a linear relationship (that is, a relationship that increases or decreases at a constant rate).
Examples:
The number of times a cricket chirps per second is strongly related to temperature;when it's cold outside, they chirp less frequently, and as the temperature warms up, they chirp at a steadily increasing rate. The number of cricket chirps and temperature have a strong positive correlation.
The number of crimes (per capita) has often been found to be related to the number of police officers in a given area. When more police officers patrol the area, crime tends to be lower, and when fewer police officers are present in the same area, crime tends to be higher. the number of police officers and the number of crimes have a strong negative correlation.
The consumption of ice cream (pints per person) and the number of murders in New York are positively correlated. That is, as the amount of ice cream sold per person increases, the number of murders increases.
**Correlation isn't able to explain why or how the relationship between two variables , x and y, exists; only that it does exist.
Causation: States that a change in the value of the x variable will cause a change in the value of the y variable.
No comments:
Post a Comment