Understanding Linear Regression: From Marketing to Mathematics

📊

Also Available as Interactive Presentation

Learn visually with slides and animations

View Presentation

Imagine you’re a marketing manager trying to predict next month’s sales based on your advertising budget. How much should you spend to hit your target? This is where linear regression comes in—one of the most fundamental techniques in machine learning and statistics. Let’s explore how we can find mathematical relationships in data to make accurate predictions.

The Marketing Problem

💭

What if we could predict the future sales based on our advertising expenses? Linear regression makes this possible by finding patterns in historical data.

Marketing Dataset

Suppose we have the following historical data from our company:

Monthly TV Advertisement ExpenseMonthly Sales
1205
1257
1408
1106

Our goal: Find a function that represents the relationship between advertising expense (input) and sales (output) to predict future sales.

We want to find a function that represents the relationship between the input variable (xx, advertising expense) and the output variable (yy, sales) to predict the output given the input variable in new (future) data. We assume the relationship is linear.

What is a Linear Function?

Linear Function

A linear function is a function that forms:

  • A straight line in 2-dimensional space
  • A flat plane in 3-dimensional space
  • A hyperplane in higher dimensions

In mathematical terms, a simple linear function can be written as:

y^=f(x)=w0+w1x\hat{y} = f(x) = w_0 + w_1x

where w0w_0 is the intercept and w1w_1 is the slope.

The Stochastic Nature of Real Data

In reality, we don’t know exactly all the variables that affect our monthly sales. There are many factors beyond TV advertising—seasonality, competitor actions, economic conditions, and random chance. This uncertainty means our function needs to account for randomness.

💡 Why Add an Error Term?

Real-world relationships are never perfect. By adding an error term, we acknowledge that our model won’t capture everything perfectly—and that’s okay! We’re looking for useful approximations, not perfect predictions.

We modify our function to include a random variable:

y^=f(x)=w0+w1x+ε\hat{y} = f(x) = w_0 + w_1x + \varepsilon

where ε\varepsilon (epsilon) is the error term or residual.

Understanding the Error Function

Error (Residual)

The error for a data point (xi,yi)(x_i, y_i) is the difference between the actual value and our predicted value:

ei=yi(w0+w1xi)e_i = y_i - (w_0 + w_1x_i)

This represents how far our prediction is from the true value.

Our goal is to minimize this error function. Our linear function may not fit the data exactly, but we can tolerate some error—we just want to make it as small as possible across all our data points.

Finding the Best Line

There are infinite possible lines (linear functions) we could draw through our data. How do we find the best one?

🎯

We want to search for the line that makes the smallest total error across all observations. But how do we measure “total error”?

First Attempt: Sum of Errors

Our first instinct might be to simply add up all the errors:

ei\sum e_i
⚠️ Problem with Simple Sum

This approach has a critical flaw! Consider a line positioned above all our data points. Every error would be negative. If we move the line further up, the errors become more negative, making our metric worse in the wrong direction.

The problem: Positive and negative errors cancel each other out, so this metric doesn’t properly capture the magnitude of our mistakes.

Better Approaches: Non-Negative Metrics

We need a metric where all errors are non-negative. Two popular choices are:

Sum of Absolute Errors (SAE)

ei\sum |e_i|
  • Treats all errors equally
  • Same weight for small and large errors
  • More robust to outliers

Sum of Squared Errors (SSE)

ei2=(yi(w0+w1xi))2\sum e_i^2 = \sum (y_i - (w_0 + w_1x_i))^2
  • Penalizes larger errors more heavily
  • Small errors get small weights
  • Large errors get large weights
  • Mathematically convenient
💡 Why Choose SSE?

We typically use SSE (Sum of Squared Errors) because:

  1. It heavily penalizes outliers, making our model more sensitive to large mistakes
  2. It’s mathematically differentiable, making optimization easier
  3. It has nice statistical properties (related to maximum likelihood estimation)

Our optimization metric becomes:

SSE=ei2=(yi(w0+w1xi))2\text{SSE} = \sum e_i^2 = \sum (y_i - (w_0 + w_1x_i))^2

Extending to Multiple Variables

So far, we’ve looked at linear relationships with one input variable. But what if our sales depend on multiple factors—TV advertising, radio advertising, and social media spending?

Two Variables: A Flat Plane

When we have two input variables, our linear relationship becomes a flat plane in 3D space.

Two-Variable Linear Function

With two input variables x1x_1 (TV advertising) and x2x_2 (radio advertising):

y^=f(x)=w0+w1x1+w2x2\hat{y} = f(x) = w_0 + w_1x_1 + w_2x_2

Our error metric becomes:

ei2=(yi(w0+w1xi1+w2xi2))2\sum e_i^2 = \sum (y_i - (w_0 + w_1x_{i1} + w_2x_{i2}))^2

General Case: n-Dimensional Space

General Linear Function

For nn input variables, we can write the linear function as:

y^=f(x)=w0+w1x1+w2x2++wn1xn1\hat{y} = f(x) = w_0 + w_1x_1 + w_2x_2 + \cdots + w_{n-1}x_{n-1}

And our error metric becomes:

ei2=(yi(w0+w1xi1+w2xi2++wn1xin1))2\sum e_i^2 = \sum (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2

Matrix Formulation: The Elegant Approach

Writing out all those terms becomes cumbersome. We can simplify everything using matrix notation!

💡 Power of Matrices

Matrix notation isn’t just about making equations look prettier—it allows us to handle thousands of variables with the same simple formula!

We can express our entire problem compactly:

SSE=yXW22\text{SSE} = \|y - XW\|_2^2

where:

  • XRm×nX \in \mathbb{R}^{m \times n} is our input data matrix (mm observations, nn features)
  • yRmy \in \mathbb{R}^m is our output data vector
  • WRnW \in \mathbb{R}^n is our weight vector (parameters to find)
  • 22\|\cdot\|_2^2 is the squared L2 norm (sum of squared elements)

Solving for the Optimal Weights

Now comes the mathematical derivation to find the weights WW that minimize our error function.

🔍

We’re about to use calculus to find the minimum of our error function. The key insight: at a minimum, the derivative (gradient) equals zero!

Let’s expand the matrix expression:

f(W)=yXW22=(yXW)T(yXW)=(yTWTXT)(yXW)=yTyWTXTyyTXW+WTXTXW\begin{aligned} f(W) = \|y - XW\|_2^2 &= (y - XW)^T(y - XW) \\ &= (y^T - W^TX^T)(y - XW) \\ &= y^Ty - W^TX^Ty - y^TXW + W^TX^TXW \end{aligned}

Since WTXTyW^TX^Ty and yTXWy^TXW are scalars and equal to each other:

f(W)=yXW22=yTy2yTXW+WTXTXWf(W) = \|y - XW\|_2^2 = y^Ty - 2y^TXW + W^TX^TXW

To minimize f(W)f(W), we take the gradient with respect to WW and set it equal to zero:

f(W)=2XTy+2XTXW=0\nabla f(W) = -2X^Ty + 2X^TXW = 0

Solving for WW:

XTXW=XTyX^TXW = X^Ty
Normal Equation

The optimal weight vector is given by the normal equation:

W=(XTX)1XTyW = (X^TX)^{-1}X^Ty

This is the closed-form solution for linear regression!

The Computational Challenge

We’ve found a beautiful formula, but there’s a practical problem lurking beneath.

⚠️ Computational Complexity

Computing the inverse of a matrix is computationally expensive:

  • For 1,000 observations: We need to invert a 1000×10001000 \times 1000 matrix
  • For 100,000 observations: We need to invert a 100,000×100,000100,000 \times 100,000 matrix

Matrix inversion has complexity O(n3)O(n^3), making this solution impractical for large datasets.

💡 The Solution: Iterative Methods

Instead of computing the inverse directly, we use iterative optimization methods like Gradient Descent. These methods:

  • Take small steps toward the minimum
  • Don’t require matrix inversion
  • Scale much better to large datasets
  • Are the foundation of modern machine learning

Gradient Descent will be explained in detail in our next post!

Key Takeaways

🎓

What We’ve Learned:

  1. Linear regression finds the best-fitting linear relationship between input and output variables
  2. We measure “best” using Sum of Squared Errors (SSE)
  3. The normal equation W=(XTX)1XTyW = (X^TX)^{-1}X^Ty gives the optimal solution
  4. For large datasets, we need iterative methods like Gradient Descent
  5. Linear regression extends naturally from one variable to many dimensions

Practice Problems

Level 1:

Given three data points: (1,2)(1, 2), (2,4)(2, 4), and (3,5)(3, 5), calculate the errors eie_i for each point if our line is y=1+1.5xy = 1 + 1.5x.

Click for hint Use the formula ei=yi(w0+w1xi)e_i = y_i - (w_0 + w_1x_i) where w0=1w_0 = 1 and w1=1.5w_1 = 1.5.

Click for solution

For each point:

  • Point 1: e1=2(1+1.51)=22.5=0.5e_1 = 2 - (1 + 1.5 \cdot 1) = 2 - 2.5 = -0.5
  • Point 2: e2=4(1+1.52)=44=0e_2 = 4 - (1 + 1.5 \cdot 2) = 4 - 4 = 0
  • Point 3: e3=5(1+1.53)=55.5=0.5e_3 = 5 - (1 + 1.5 \cdot 3) = 5 - 5.5 = -0.5

Sum of Squared Errors: SSE=(0.5)2+02+(0.5)2=0.5\text{SSE} = (-0.5)^2 + 0^2 + (-0.5)^2 = 0.5

Level 2:

Explain why we square the errors in SSE instead of just taking the absolute value. What are the trade-offs?

Click for hint Think about how different error magnitudes are weighted, and consider the mathematical properties needed for optimization.

Click for solution

Advantages of SSE (squared errors):

  • Heavily penalizes large errors (outliers have quadratic impact)
  • Differentiable everywhere (smooth optimization)
  • Related to Gaussian assumptions in statistics
  • Unique global minimum (convex function)

Advantages of SAE (absolute errors):

  • More robust to outliers
  • Treats all errors equally
  • Better for data with heavy-tailed distributions

Trade-off: SSE is mathematically convenient and commonly used, but SAE is more robust when you have outliers or don’t want to over-penalize large errors.

References

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. - Introduction to Statistical Learning
  2. Posik, P. - Linear Methods for Regression and Classification
  3. Boyd, S., & Vandenberghe, L. - Convex Optimization

What’s Next?

In our next post, we’ll dive deep into Gradient Descent—the iterative optimization method that makes linear regression practical for massive datasets and forms the foundation for training neural networks!