Understanding Linear Regression: From Marketing to Mathematics

Opening: The Marketing Challenge

The Marketing Manager's Question

You run marketing for a growing company. Every month, you decide how much to spend on TV ads.

But here's the problem...

💰 The Big Question:

"How much should I spend on advertising next month to hit my $10 million sales target?"

📊

Data from the past

🔮

Predict the future

✅

Make decisions

Theme: Finding Patterns in Data

"If we can find a mathematical relationship between our inputs (ad spending) and outputs (sales), we can predict the future."

This is the foundation of linear regression.

Setup: The Historical Data

Let's Look at Our Data

Marketing Dataset: TV Ads vs. Sales

Month	TV Ad Budget ($1000s)	Sales ($ millions)
January	120	5
February	125	7
March	140	8
April	110	6

Pattern emerging? More advertising → More sales

But can we quantify this relationship? 🤔

Setup: What is a Linear Function?

The Foundation: Understanding Linear Functions

Definition

A linear function creates:

A straight line in 2D space
A flat plane in 3D space
A hyperplane in higher dimensions

The Mathematical Form

$\hat{y} = f(x) = w_0 + w_1x$

$w_0$ = intercept

Where line crosses y-axis

$w_1$ = slope

How steep the line is

Catalyst: The Reality Check

Wait... There's a Problem!

Real-world data is messy. Sales don't perfectly follow a line because:

🌦️ Seasonality (holidays, weather)
🏢 Competitor actions
📉 Economic conditions
🎲 Random chance
📱 Other marketing channels
👥 Word-of-mouth effects

Our model won't be perfect—and that's okay! 🎯

We need to account for uncertainty...

Adding Randomness: The Error Term

The Stochastic Nature of Data

Modified Linear Function

We add a random variable to capture uncertainty:

$\hat{y} = f(x) = w_0 + w_1x + \varepsilon$

$\varepsilon$ (epsilon) = error term or residual

Captures everything our model doesn't explain

💡 Key Insight:

We're looking for useful approximations, not perfect predictions. By acknowledging error, we build more realistic models!

Understanding Error

What is Error (Residual)?

Definition

For each data point $(x_i, y_i)$, the error is:

$e_i = y_i - (w_0 + w_1x_i)$

$y_i$

Actual value (true sales)

$(w_0 + w_1x_i)$

Predicted value

Interpretation

Positive error: We underestimated (predicted too low)
Negative error: We overestimated (predicted too high)
Zero error: Perfect prediction! (rare)

Break into Two: The Optimization Problem

Finding the Best Line

There are infinite possible lines through our data. Which one is best?

Our Goal

Find the line that makes the smallest total error across all data points.

But how do we measure "total error"? 🤔

⚠️ The Challenge:

We need a metric that:

✓ Captures error magnitude
✓ Doesn't let errors cancel out
✓ Can be minimized mathematically

First Attempt: Sum of Errors (FAILS!)

Naive Approach: Just Add Them Up

First Idea

Sum all the errors:

$\sum e_i = \sum (y_i - (w_0 + w_1x_i))$

❌ Fatal Flaw!

Problem: Positive and negative errors cancel out!

Example:
• Point 1: error = +5 (underestimated)
• Point 2: error = -5 (overestimated)
• Total: +5 + (-5) = 0 ✓ "Perfect"?

But both predictions were wrong by 5 units!

The Takeaway:

We need a metric where all errors are non-negative so they can't cancel out.

Two Better Metrics

Non-Negative Error Metrics

Sum of Absolute Errors

$\sum |e_i|$

Pros:

✓ All errors positive
✓ Equal weight to all errors
✓ Robust to outliers

Cons:

✗ Not differentiable at zero
✗ Harder to optimize

Sum of Squared Errors ⭐

$\sum e_i^2$

Pros:

✓ All errors positive
✓ Heavily penalizes outliers
✓ Differentiable everywhere
✓ Nice statistical properties

Winner!

Used in linear regression

SSE: Our Optimization Metric

Sum of Squared Errors (SSE)

The Formula

$\text{SSE} = \sum e_i^2 = \sum (y_i - (w_0 + w_1x_i))^2$

Expand each term and square it!

Why Squaring Works

Small errors

Get small penalties

$1^2 = 1$

Large errors

Get huge penalties

$10^2 = 100$

Extending to Multiple Variables

What About Multiple Inputs?

Real-World Scenario

Sales might depend on multiple factors:

📺

TV Ads

$x_1$

📻

Radio Ads

$x_2$

📱

Social Media

$x_3$

Two Variables: A Plane

$\hat{y} = w_0 + w_1x_1 + w_2x_2$

Forms a flat plane in 3D space

General Form: n Variables

Generalizing to n Dimensions

General Linear Function

For $n$ input variables:

$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \cdots + w_{n-1}x_{n-1}$

Error Metric (SSE)

$\sum e_i^2 = \sum (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$

This notation gets messy fast! 😰

Solution: Use matrices! ➡️

Midpoint: The Matrix Insight

The Power of Matrix Notation

Before: Messy Summations

$\sum_{i=1}^{m} (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$

Ugly! Hard to work with! 😫

After: Elegant Matrix Form

$\text{SSE} = \|y - XW\|_2^2$

Beautiful! Works for any dimension! ✨

Matrix Formulation

Matrix Notation Explained

The Setup

$\text{SSE} = \|y - XW\|_2^2$

$X \in \mathbb{R}^{m \times n}$

Input data matrix
$m$ = observations
$n$ = features

$y \in \mathbb{R}^m$

Output vector
$m$ = observations
Actual values

$W \in \mathbb{R}^n$

Weight vector
$n$ = features
Parameters to find

$\|\cdot\|_2^2$ = Squared L2 Norm

Fancy name for: "square each element, then sum"
$\|v\|_2^2 = v_1^2 + v_2^2 + \cdots + v_n^2$

Solving: The Calculus Approach

Finding the Optimal Weights

The Strategy

Use calculus to minimize the error function!

Key insight from calculus:

"At a minimum, the derivative (gradient) equals zero"

Just like finding the bottom of a valley! 🏔️

The Plan

1️⃣ Expand $f(W) = \|y - XW\|_2^2$
2️⃣ Take gradient: $\nabla f(W)$
3️⃣ Set equal to zero: $\nabla f(W) = 0$
4️⃣ Solve for $W$

Derivation Step 1: Expand

Expanding the Matrix Expression

Step 1: Use Matrix Properties

Start with:
$f(W) = \|y - XW\|_2^2$

Rewrite as dot product:
$f(W) = (y - XW)^T(y - XW)$

Distribute transpose:
$= (y^T - W^TX^T)(y - XW)$

FOIL (multiply out):
$= y^Ty - W^TX^Ty - y^TXW + W^TX^TXW$

Step 2: Simplify

Since $W^TX^Ty$ and $y^TXW$ are scalars and equal:

$f(W) = y^Ty - 2y^TXW + W^TX^TXW$

Derivation Step 2: Gradient

Taking the Gradient

Recall Our Function

$f(W) = y^Ty - 2y^TXW + W^TX^TXW$

Take Gradient w.r.t. $W$

Apply derivative rules:

• $\nabla_W(y^Ty) = 0$ (constant)

• $\nabla_W(-2y^TXW) = -2X^Ty$ (linear)

• $\nabla_W(W^TX^TXW) = 2X^TXW$ (quadratic)

Result:

$\nabla f(W) = -2X^Ty + 2X^TXW$

Derivation Step 3: Solve

Solving for Optimal $W$

Set Gradient to Zero

$\nabla f(W) = -2X^Ty + 2X^TXW = 0$

↓ Divide by 2

$-X^Ty + X^TXW = 0$

↓ Add $X^Ty$ to both sides

$X^TXW = X^Ty$

The Normal Equation ⭐

$W = (X^TX)^{-1}X^Ty$

Closed-form solution for linear regression!

All Is Lost: Computational Challenge

Wait... There's a Problem!

The Normal Equation

$W = (X^TX)^{-1}X^Ty$

Looks elegant, but...

⚠️ Matrix Inversion is EXPENSIVE

1,000 observations → Invert $1000 \times 1000$ matrix

Time complexity: $O(n^3)$ 🐌

100,000 observations → Invert $100,000 \times 100,000$ matrix

Takes hours or crashes! 💥

Break into Three: The Solution

💡 Iterative Optimization!

The Alternative: Gradient Descent

Instead of computing the inverse directly, take small steps toward the minimum!

How It Works

1️⃣ Start with random weights
2️⃣ Calculate gradient (direction of steepest increase)
3️⃣ Move in opposite direction (downhill)
4️⃣ Repeat until we reach the minimum

🎯 Advantages:

✓ No matrix inversion needed
✓ Scales to millions of data points
✓ Foundation of modern ML
✓ Works for non-linear models too!

Finale: Comparison

Two Ways to Solve Linear Regression

Normal Equation

$W = (X^TX)^{-1}X^Ty$

Pros:
✓ Direct solution
✓ No hyperparameters
✓ Exact answer

Cons:
✗ $O(n^3)$ complexity
✗ Doesn't scale
✗ Memory intensive

Best for:

Small datasets (< 10k obs)

Gradient Descent ⭐

Iterative updates

Pros:
✓ Scales to huge data
✓ Low memory
✓ Generalizes to non-linear

Cons:
✗ Needs tuning
✗ Approximate solution
✗ Takes iterations

Best for:

Large datasets, deep learning

Final Image: Key Takeaways

Linear Regression: The Complete Picture

The Problem 🎯

Find a linear relationship between inputs and outputs to predict future values

The Error Term 🎲

Add $\varepsilon$ to account for randomness and unmeasured factors

The Metric 📊

Use Sum of Squared Errors (SSE) to measure fit quality

The Solution 🔧

Normal Equation (small data) or Gradient Descent (large data)

Key Formulas

Linear Model

$\hat{y} = w_0 + w_1x + \varepsilon$

Matrix Form

$\text{SSE} = \|y - XW\|_2^2$

Normal Equation

$W = (X^TX)^{-1}X^Ty$

Next Topic

Gradient Descent! 🚀

What's Next?

Coming Up Next: Gradient Descent!

Now that we understand linear regression, we need to learn how to actually solve it for large datasets.

🔍 We'll explore:

How gradient descent takes steps toward the minimum
Learning rates and convergence
Batch vs. stochastic gradient descent
Why it's the backbone of deep learning

"The journey from theory to practice begins with optimization!"

- QuiverLearn