(Note: This post is based on content from a Microsoft Excel 2003 Assistance page that does a good job of going through the step-by-step regression process in Excel, but a poor one of explaining the finance behind the example. To do this yourself, you’ll need to follow instructions to load the Analysis Tool-pak Add-In, which should already come installed with Excel.)
Imagine that you run a factory producing cool little widgets. You have the monthly data on the number of widgets you produced last year and how much it cost to produce them. How would you find out your variable and fixed costs for producing each widget? Management wants to know how much it will cost to double last year’s production numbers. What would you tell them?
|Month||Production cost||# of Widgets|
In finance, costs are generally separated into “variable” and “fixed”. Fixed costs are costs you pay for whether you’re producing 0 or 1 million widgets. Examples might be the portion of rents for the factory space, costs for setup of equipment, utility costs, etc. Variable costs, on the other hand, are things that do change with production numbers. For example, you might need twice as much material to make 1,000 widgets than 500 of them.
Together, fixed costs and variable costs add up to the total cost of production. But in this case, we’re given the total instead, and have to determine its parts. This is sort of like saying, “I’m thinking of two positive numbers that add up to the number 5. What numbers am I thinking of?” The answers are infinite: 0 and 5, 2 and 3, 4.1 and 0.9, etc. But luckily, if we’re given good data, using linear regression, we can figure out which pair best “fits” the data.
So, we load this data into an Excel table and make sure we’ve added in the Analysis Tool Pak. Then we choose Tools >> Data Analysis >> Regression and hit “OK”. For “Input Y Range” we select the 12 production costs we have, and for “Input X Range” we select the 12 numbers we have under # of widgets produced. To get a better idea of what we’re doing, also check the box next to “Line Fit Plots” under the “Residuals” section. Note that Excel’s Regression message box defaulted to a 95% confidence level, which we won’t adjust. (If you’re confused, this is a good place to refer back to the original Microsoft assistance page for a step-by-step guide with pictures.)
Hit “OK” and the output is displayed on a new worksheet.
The fit (or equation) of the data can be easily found. In our example, we had one variable, cost, depending on another variable, # of units. We believe that the data are best described by a linear line, and if you can remember back to your basic algebra class, the general equation for a line looks like this:
The same situation applies here: “m” is also known as “X Variable 1” in our output, and “b” is the “Intercept”. Our equation is therefore:
But hold on. How good is our fit? In no other place is analysis is the saying “Garbage in, garbage out” more true than in statistics.
Honestly, to fully understand the meaning of all the numbers on this page, you’ll need to learn more about statistics than I can provide. But I’ll highlight some of the more important statistics here:
- R-square: This number explains how well the fit (or equation) explains the variance in the data. R-square can vary between 0 and 1, 1 being a theoretically “perfect” fit. In reality, if you get an R-square of 1.0, consider whether your data might have been doctored :) In our case, we have a very respectable value of 0.954, meaning 95.4% of our data can be explained by this fit. With real-world data, you often can’t get a fit that explains 95% of the data.
- Standard error: These figures represent the uncertainty around the coefficients of your fit. In this case, we see uncertainties of 3388.39 and 1.447. This means that when we use the equation to predict our costs, we shouldn’t just give one figure for the variable, fixed, and total costs but an upper and lower estimation as well. Everything has uncertainty. Suppose I tell you to take out a ruler and measure the width of your hand. You might measure 5 times and get 5 different readouts, partly because the ruler isn’t precise, partly due to human error, maybe the ruler wasn’t straight a few times, etc. So the idea that you can give me one number about the width of your hand with 100% certainty is erroneous.
- P-value: These numbers are read as percentages as well and represent the probability that random chance would explain the data sample better than our coefficient. In our case, our p-values for each coefficient are 0.45% and 0.0000049%. In general, the lower the p-value, the better, since it says that that variable is significant in explaining your data.
Back to finance. Now that we have our equation, we can see that our production costs are mostly due to fixed costs of $12,320. Picture this to mean that even if we stopped producing widgets tomorrow, we’d still have to be paying the lease for the factory, utilities, management salaries, etc. Each widget also requires a variable cost of $20.30 to make.
Last year, we had a total production of 27,900 widgets. If we double production to 55,800, then we could expect an average cost of:
Notice that this is a lot less than just taking the total production cost from last year for all units and multiplying by two. That gets you $1.465M and ignores the principle of economies of scale and fixed costs. Assuming you were presenting to management, would you just give them this $1.18M figure? Probably not; most likely, you’d give them a range. Remember from above that we had set left the confidence level at 95% and discussed the Standard Errors from our output, which explained the uncertainty we had in our equation? Here’s how you would use this bit of information:
Maximum cost = (20.956+1.447) * 55,800 + (12319.71+3388.39) = $1.27M
In your presentation, you say, “I am 95% confident that doubling our production next year would cost between $1.10M to $1.27M.”
Despite being such a long writeup, regression really is useful in business. It can be used for all sorts of cost modeling in finance and even competitive analysis in marketing. But managers and those who work on the “business” side of things tend to underutilize statistics in their work, in my experience. About the only people you’ll find who regularly use statistical analysis in their work are engineers and some operations and plant managers (who tend to have engineering backgrounds). Part of this is because many managers and analysts not have been taught how to apply statistics to real-life situations. Part of it is due to the difficulty in communicating results from statistical analysis in a way that others (particularly higher-ups) will understand: in plain English.
Put simply, if you’re looking for a skill that will give you an edge in your job (and wouldn’t hurt with getting promotions), I highly recommend taking courses or learning about how to apply statistical analysis to business.