Understanding Linear Regression: From Marketing to Mathematics
Imagine you’re a marketing manager trying to predict next month’s sales based on your advertising budget. How much should you spend to hit your target? This is where linear regression comes in—one of the most fundamental techniques in machine learning and statistics. Let’s explore how we can find mathematical relationships in data to make accurate predictions.
The Marketing Problem
What if we could predict the future sales based on our advertising expenses? Linear regression makes this possible by finding patterns in historical data.
Marketing Dataset
Suppose we have the following historical data from our company:
| Monthly TV Advertisement Expense | Monthly Sales |
|---|---|
| 120 | 5 |
| 125 | 7 |
| 140 | 8 |
| 110 | 6 |
Our goal: Find a function that represents the relationship between advertising expense (input) and sales (output) to predict future sales.
We want to find a function that represents the relationship between the input variable (, advertising expense) and the output variable (, sales) to predict the output given the input variable in new (future) data. We assume the relationship is linear.
What is a Linear Function?
A linear function is a function that forms:
- A straight line in 2-dimensional space
- A flat plane in 3-dimensional space
- A hyperplane in higher dimensions
In mathematical terms, a simple linear function can be written as:
where is the intercept and is the slope.
The Stochastic Nature of Real Data
In reality, we don’t know exactly all the variables that affect our monthly sales. There are many factors beyond TV advertising—seasonality, competitor actions, economic conditions, and random chance. This uncertainty means our function needs to account for randomness.
Real-world relationships are never perfect. By adding an error term, we acknowledge that our model won’t capture everything perfectly—and that’s okay! We’re looking for useful approximations, not perfect predictions.
We modify our function to include a random variable:
where (epsilon) is the error term or residual.
Understanding the Error Function
The error for a data point is the difference between the actual value and our predicted value:
This represents how far our prediction is from the true value.
Our goal is to minimize this error function. Our linear function may not fit the data exactly, but we can tolerate some error—we just want to make it as small as possible across all our data points.
Finding the Best Line
There are infinite possible lines (linear functions) we could draw through our data. How do we find the best one?
We want to search for the line that makes the smallest total error across all observations. But how do we measure “total error”?
First Attempt: Sum of Errors
Our first instinct might be to simply add up all the errors:
This approach has a critical flaw! Consider a line positioned above all our data points. Every error would be negative. If we move the line further up, the errors become more negative, making our metric worse in the wrong direction.
The problem: Positive and negative errors cancel each other out, so this metric doesn’t properly capture the magnitude of our mistakes.
Better Approaches: Non-Negative Metrics
We need a metric where all errors are non-negative. Two popular choices are:
Sum of Absolute Errors (SAE)
- Treats all errors equally
- Same weight for small and large errors
- More robust to outliers
Sum of Squared Errors (SSE)
- Penalizes larger errors more heavily
- Small errors get small weights
- Large errors get large weights
- Mathematically convenient
We typically use SSE (Sum of Squared Errors) because:
- It heavily penalizes outliers, making our model more sensitive to large mistakes
- It’s mathematically differentiable, making optimization easier
- It has nice statistical properties (related to maximum likelihood estimation)
Our optimization metric becomes:
Extending to Multiple Variables
So far, we’ve looked at linear relationships with one input variable. But what if our sales depend on multiple factors—TV advertising, radio advertising, and social media spending?
Two Variables: A Flat Plane
When we have two input variables, our linear relationship becomes a flat plane in 3D space.
Two-Variable Linear Function
With two input variables (TV advertising) and (radio advertising):
Our error metric becomes:
General Case: n-Dimensional Space
For input variables, we can write the linear function as:
And our error metric becomes:
Matrix Formulation: The Elegant Approach
Writing out all those terms becomes cumbersome. We can simplify everything using matrix notation!
Matrix notation isn’t just about making equations look prettier—it allows us to handle thousands of variables with the same simple formula!
We can express our entire problem compactly:
where:
- is our input data matrix ( observations, features)
- is our output data vector
- is our weight vector (parameters to find)
- is the squared L2 norm (sum of squared elements)
Solving for the Optimal Weights
Now comes the mathematical derivation to find the weights that minimize our error function.
We’re about to use calculus to find the minimum of our error function. The key insight: at a minimum, the derivative (gradient) equals zero!
Let’s expand the matrix expression:
Since and are scalars and equal to each other:
To minimize , we take the gradient with respect to and set it equal to zero:
Solving for :
The optimal weight vector is given by the normal equation:
This is the closed-form solution for linear regression!
The Computational Challenge
We’ve found a beautiful formula, but there’s a practical problem lurking beneath.
Computing the inverse of a matrix is computationally expensive:
- For 1,000 observations: We need to invert a matrix
- For 100,000 observations: We need to invert a matrix
Matrix inversion has complexity , making this solution impractical for large datasets.
Instead of computing the inverse directly, we use iterative optimization methods like Gradient Descent. These methods:
- Take small steps toward the minimum
- Don’t require matrix inversion
- Scale much better to large datasets
- Are the foundation of modern machine learning
Gradient Descent will be explained in detail in our next post!
Key Takeaways
What We’ve Learned:
- Linear regression finds the best-fitting linear relationship between input and output variables
- We measure “best” using Sum of Squared Errors (SSE)
- The normal equation gives the optimal solution
- For large datasets, we need iterative methods like Gradient Descent
- Linear regression extends naturally from one variable to many dimensions
Practice Problems
Given three data points: , , and , calculate the errors for each point if our line is .
Click for hint
Use the formula where and .
Click for solution
For each point:
- Point 1:
- Point 2:
- Point 3:
Sum of Squared Errors:
Explain why we square the errors in SSE instead of just taking the absolute value. What are the trade-offs?
Click for hint
Think about how different error magnitudes are weighted, and consider the mathematical properties needed for optimization.
Click for solution
Advantages of SSE (squared errors):
- Heavily penalizes large errors (outliers have quadratic impact)
- Differentiable everywhere (smooth optimization)
- Related to Gaussian assumptions in statistics
- Unique global minimum (convex function)
Advantages of SAE (absolute errors):
- More robust to outliers
- Treats all errors equally
- Better for data with heavy-tailed distributions
Trade-off: SSE is mathematically convenient and commonly used, but SAE is more robust when you have outliers or don’t want to over-penalize large errors.
References
- James, G., Witten, D., Hastie, T., & Tibshirani, R. - Introduction to Statistical Learning
- Posik, P. - Linear Methods for Regression and Classification
- Boyd, S., & Vandenberghe, L. - Convex Optimization
What’s Next?
In our next post, we’ll dive deep into Gradient Descent—the iterative optimization method that makes linear regression practical for massive datasets and forms the foundation for training neural networks!