In relation to Excel, the famous expression “garbage in, garbage out” means that even if you are completing the right calculation you will not get the correct results if the data you are using in your calculation are incorrect.
The same is true of the models you build in Excel. Even if you have good models with correct calculations, you cannot make good decisions using those models if the data you are analyzing are incorrect. In this lesson, we will discuss what it means to gather sample data and use that data to approximate appropriate values for your spreadsheet models.
Difference between a sample and a population
Sample
A sample is part of the population. For example, if I were interested the height of all of the people in the United States, the population would be all people in the United States. A sample would be some subset of the population, or a smaller group of people that live in the United States.
Building our sample correctly can give us an accurate approximation of our population. Key to effective sampling is creating a sample that gives every element of the population the same likelihood of being included in the sample. It is also important to have a large enough sample. Ex. We may want to know the average height of everyone in the United States. We could measure the height of every person in the United States and calculate the average of all of those measurements. But it would be extremely time-consuming and costly to travel everywhere in the United States and measure the height of each person.
More realistically, we could measure a sample of the population (for example, 100 people) and use the average height of the people in our sample as an estimate of the average height of the population of the United States.
Remember! When you’re calculating the average or standard deviation on a sample of data, we are creating an estimate of the average or standard deviation of the population. Since we do not have all of the data from the entire population, it is impossible to calculate the “true” average or standard deviation for the population. Sample Average
Calculating the sample average is the first step in computing statistics.
The regular =AVERAGE(num1,num2...) is the function used to calculate the average of the sample. Standard Deviation
Calculating the standard deviation is the second step.
The standard deviation is the measure of how diverse the elements in a data set are. We calculate the standard deviation of a sample using the STDEV.S function and the standard deviation of a population using the STDEV.P function. A large standardization implies that there is a lot of diversity in our data, while a small standard deviation means that the individual pieces of data are relatively similar. =STDEV.S(number1, [number2],…) is the standard deviation of a sample. =STDEV.P(number1, [number2],…) is the standard deviation of an entire population. Confidence Intervals
To examine the quality of our estimates, we construct confidence intervals. A confidence interval is the range of values within which we believe the “true” value for the population falls. We calculate a confidence interval using the CONFIDENCE.NORM function in Excel.
=CONFIDENCE.NORM(alpha, standard_dev, size) alpha is the level of confidence that we want to have with our estimate. Alpha is actually calculated as 1—our confidence level. So, if we want to be 95% confident in our estimate, we would set alpha to be .05 (or 1-95%). This would mean that we are willing to accept a 5% chance that the “true” value will fall outside our confidence interval. standard_dev is the sample standard deviation for our analysis and the "size" represents the size of our sample. Using the CONFIDENCE.NORM function, we are able to calculate the range of values within which the “true” value can be found. Estimating Inputs with Correlations and Forecasts
Forecasting
Forecasting allows you to make predictions about one set of values in your spreadsheet using information you know about another set of values.
There needs to be an underlying relationship between the two sets of values before we can forecast any more values. To determine the strength of a relationship between two sets of values in Excel, we will calculate a correlation coefficient. The FORECAST function predicts a future value for a set of numbers based on a known relationship and trend in another set of values. The forecast value is our best guess given the information that we have. It makes sense, then to construct a confidence interval around the forecast. This way you can get a feel for the range of likely values for your estimate. Requires three arguments: X, known_Y’s, and known_X’s. Y is the dependent variable and the variable we’re trying to predict. X is the independent variable and the variable being used to predict Y. known_X’s argument refers to the set of values that corresponds to all of the values for X. known_Y’s argument refers to the set of values that corresponds to all of the values for Y.
Correlation Coefficient
A correlation is a statistical relationship between two things. Correlation coefficients range between -1 and 1.
The CORREL function in Excel calculates a correlation coefficient. The CORREL function requires two “arrays” as arguments. An “array” is a set of values. There have to be at least two numbers in the set for it to qualify as an array. Positive Correlation indicates that when values for one thing increase, values for the other also increase. Ex. An example of two things that are positively correlated would be ice cream sales and outside temperature. When things have a strong positive correlation, the correlation coefficient will be close to 1. Negative Correlation indicates that when values for one thing decreases, values for the other decreases. Ex. As temperatures go down, hot chocolate sales go up. Strong negative correlations are close to -1.
Weak Correlation Coefficient is when the coefficient is close to 0 (between -.3 and about .3) Moderate Correlation Coefficient is when the coefficient is between -.3 and -.7 or between .3 and .7 Strong Correlation Coefficient is between .7 and 1 or between -.7 and -1.