Understanding Linear Regression: From Marketing to Mathematics

Opening: The Marketing Challenge

The Marketing Manager's Question

You run marketing for a growing company. Every month, you decide how much to spend on TV ads.

But here's the problem...

๐Ÿ’ฐ The Big Question:

"How much should I spend on advertising next month to hit my $10 million sales target?"

๐Ÿ“Š

Data from the past

๐Ÿ”ฎ

Predict the future

โœ…

Make decisions

Theme: Finding Patterns in Data

"If we can find a mathematical relationship between our inputs (ad spending) and outputs (sales), we can predict the future."

This is the foundation of linear regression.

Setup: The Historical Data

Let's Look at Our Data

Marketing Dataset: TV Ads vs. Sales
Month TV Ad Budget ($1000s) Sales ($ millions)
January 120 5
February 125 7
March 140 8
April 110 6

Pattern emerging? More advertising โ†’ More sales

But can we quantify this relationship? ๐Ÿค”

Setup: What is a Linear Function?

The Foundation: Understanding Linear Functions

Definition

A linear function creates:

  • A straight line in 2D space
  • A flat plane in 3D space
  • A hyperplane in higher dimensions
The Mathematical Form

$\hat{y} = f(x) = w_0 + w_1x$

$w_0$ = intercept

Where line crosses y-axis

$w_1$ = slope

How steep the line is

Catalyst: The Reality Check

Wait... There's a Problem!

Real-world data is messy. Sales don't perfectly follow a line because:

  • ๐ŸŒฆ๏ธ Seasonality (holidays, weather)
  • ๐Ÿข Competitor actions
  • ๐Ÿ“‰ Economic conditions
  • ๐ŸŽฒ Random chance
  • ๐Ÿ“ฑ Other marketing channels
  • ๐Ÿ‘ฅ Word-of-mouth effects

Our model won't be perfectโ€”and that's okay! ๐ŸŽฏ

We need to account for uncertainty...

Adding Randomness: The Error Term

The Stochastic Nature of Data

Modified Linear Function

We add a random variable to capture uncertainty:

$\hat{y} = f(x) = w_0 + w_1x + \varepsilon$

$\varepsilon$ (epsilon) = error term or residual

Captures everything our model doesn't explain

๐Ÿ’ก Key Insight:

We're looking for useful approximations, not perfect predictions. By acknowledging error, we build more realistic models!

Understanding Error

What is Error (Residual)?

Definition

For each data point $(x_i, y_i)$, the error is:

$e_i = y_i - (w_0 + w_1x_i)$

$y_i$

Actual value (true sales)

$(w_0 + w_1x_i)$

Predicted value

Interpretation
  • Positive error: We underestimated (predicted too low)
  • Negative error: We overestimated (predicted too high)
  • Zero error: Perfect prediction! (rare)

Break into Two: The Optimization Problem

Finding the Best Line

There are infinite possible lines through our data. Which one is best?

Our Goal

Find the line that makes the smallest total error across all data points.

But how do we measure "total error"? ๐Ÿค”

โš ๏ธ The Challenge:

We need a metric that:

โœ“ Captures error magnitude
โœ“ Doesn't let errors cancel out
โœ“ Can be minimized mathematically

First Attempt: Sum of Errors (FAILS!)

Naive Approach: Just Add Them Up

First Idea

Sum all the errors:

$\sum e_i = \sum (y_i - (w_0 + w_1x_i))$

โŒ Fatal Flaw!

Problem: Positive and negative errors cancel out!

Example:
โ€ข Point 1: error = +5 (underestimated)
โ€ข Point 2: error = -5 (overestimated)
โ€ข Total: +5 + (-5) = 0 โœ“ "Perfect"?

But both predictions were wrong by 5 units!

The Takeaway:

We need a metric where all errors are non-negative so they can't cancel out.

Two Better Metrics

Non-Negative Error Metrics

Sum of Absolute Errors

$\sum |e_i|$

Pros:

โœ“ All errors positive
โœ“ Equal weight to all errors
โœ“ Robust to outliers

Cons:

โœ— Not differentiable at zero
โœ— Harder to optimize

Sum of Squared Errors โญ

$\sum e_i^2$

Pros:

โœ“ All errors positive
โœ“ Heavily penalizes outliers
โœ“ Differentiable everywhere
โœ“ Nice statistical properties

Winner!

Used in linear regression

SSE: Our Optimization Metric

Sum of Squared Errors (SSE)

The Formula

$\text{SSE} = \sum e_i^2 = \sum (y_i - (w_0 + w_1x_i))^2$

Expand each term and square it!

Why Squaring Works

Small errors

Get small penalties

$1^2 = 1$

Large errors

Get huge penalties

$10^2 = 100$

Extending to Multiple Variables

What About Multiple Inputs?

Real-World Scenario

Sales might depend on multiple factors:

๐Ÿ“บ

TV Ads

$x_1$

๐Ÿ“ป

Radio Ads

$x_2$

๐Ÿ“ฑ

Social Media

$x_3$

Two Variables: A Plane

$\hat{y} = w_0 + w_1x_1 + w_2x_2$

Forms a flat plane in 3D space

General Form: n Variables

Generalizing to n Dimensions

General Linear Function

For $n$ input variables:

$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \cdots + w_{n-1}x_{n-1}$

Error Metric (SSE)

$\sum e_i^2 = \sum (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$

This notation gets messy fast! ๐Ÿ˜ฐ

Solution: Use matrices! โžก๏ธ

Midpoint: The Matrix Insight

The Power of Matrix Notation

Before: Messy Summations

$\sum_{i=1}^{m} (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$

Ugly! Hard to work with! ๐Ÿ˜ซ

After: Elegant Matrix Form

$\text{SSE} = \|y - XW\|_2^2$

Beautiful! Works for any dimension! โœจ

Matrix Formulation

Matrix Notation Explained

The Setup

$\text{SSE} = \|y - XW\|_2^2$

$X \in \mathbb{R}^{m \times n}$

Input data matrix
$m$ = observations
$n$ = features

$y \in \mathbb{R}^m$

Output vector
$m$ = observations
Actual values

$W \in \mathbb{R}^n$

Weight vector
$n$ = features
Parameters to find

$\|\cdot\|_2^2$ = Squared L2 Norm

Fancy name for: "square each element, then sum"
$\|v\|_2^2 = v_1^2 + v_2^2 + \cdots + v_n^2$

Solving: The Calculus Approach

Finding the Optimal Weights

The Strategy

Use calculus to minimize the error function!

Key insight from calculus:

"At a minimum, the derivative (gradient) equals zero"

Just like finding the bottom of a valley! ๐Ÿ”๏ธ

The Plan
1๏ธโƒฃ Expand $f(W) = \|y - XW\|_2^2$
2๏ธโƒฃ Take gradient: $\nabla f(W)$
3๏ธโƒฃ Set equal to zero: $\nabla f(W) = 0$
4๏ธโƒฃ Solve for $W$

Derivation Step 1: Expand

Expanding the Matrix Expression

Step 1: Use Matrix Properties

Start with:
$f(W) = \|y - XW\|_2^2$

Rewrite as dot product:
$f(W) = (y - XW)^T(y - XW)$

Distribute transpose:
$= (y^T - W^TX^T)(y - XW)$

FOIL (multiply out):
$= y^Ty - W^TX^Ty - y^TXW + W^TX^TXW$

Step 2: Simplify

Since $W^TX^Ty$ and $y^TXW$ are scalars and equal:

$f(W) = y^Ty - 2y^TXW + W^TX^TXW$

Derivation Step 2: Gradient

Taking the Gradient

Recall Our Function

$f(W) = y^Ty - 2y^TXW + W^TX^TXW$

Take Gradient w.r.t. $W$

Apply derivative rules:

โ€ข $\nabla_W(y^Ty) = 0$ (constant)

โ€ข $\nabla_W(-2y^TXW) = -2X^Ty$ (linear)

โ€ข $\nabla_W(W^TX^TXW) = 2X^TXW$ (quadratic)

Result:

$\nabla f(W) = -2X^Ty + 2X^TXW$

Derivation Step 3: Solve

Solving for Optimal $W$

Set Gradient to Zero

$\nabla f(W) = -2X^Ty + 2X^TXW = 0$

โ†“ Divide by 2

$-X^Ty + X^TXW = 0$

โ†“ Add $X^Ty$ to both sides

$X^TXW = X^Ty$

The Normal Equation โญ

$W = (X^TX)^{-1}X^Ty$

Closed-form solution for linear regression!

All Is Lost: Computational Challenge

Wait... There's a Problem!

The Normal Equation

$W = (X^TX)^{-1}X^Ty$

Looks elegant, but...

โš ๏ธ Matrix Inversion is EXPENSIVE

1,000 observations โ†’ Invert $1000 \times 1000$ matrix

Time complexity: $O(n^3)$ ๐ŸŒ

100,000 observations โ†’ Invert $100,000 \times 100,000$ matrix

Takes hours or crashes! ๐Ÿ’ฅ

Break into Three: The Solution

๐Ÿ’ก Iterative Optimization!

The Alternative: Gradient Descent

Instead of computing the inverse directly, take small steps toward the minimum!

How It Works
1๏ธโƒฃ Start with random weights
2๏ธโƒฃ Calculate gradient (direction of steepest increase)
3๏ธโƒฃ Move in opposite direction (downhill)
4๏ธโƒฃ Repeat until we reach the minimum

๐ŸŽฏ Advantages:

โœ“ No matrix inversion needed
โœ“ Scales to millions of data points
โœ“ Foundation of modern ML
โœ“ Works for non-linear models too!

Finale: Comparison

Two Ways to Solve Linear Regression

Normal Equation

$W = (X^TX)^{-1}X^Ty$

Pros:
โœ“ Direct solution
โœ“ No hyperparameters
โœ“ Exact answer

Cons:
โœ— $O(n^3)$ complexity
โœ— Doesn't scale
โœ— Memory intensive

Best for:

Small datasets (< 10k obs)

Gradient Descent โญ

Iterative updates

Pros:
โœ“ Scales to huge data
โœ“ Low memory
โœ“ Generalizes to non-linear

Cons:
โœ— Needs tuning
โœ— Approximate solution
โœ— Takes iterations

Best for:

Large datasets, deep learning

Final Image: Key Takeaways

Linear Regression: The Complete Picture

The Problem ๐ŸŽฏ

Find a linear relationship between inputs and outputs to predict future values

The Error Term ๐ŸŽฒ

Add $\varepsilon$ to account for randomness and unmeasured factors

The Metric ๐Ÿ“Š

Use Sum of Squared Errors (SSE) to measure fit quality

The Solution ๐Ÿ”ง

Normal Equation (small data) or Gradient Descent (large data)

Key Formulas

Linear Model

$\hat{y} = w_0 + w_1x + \varepsilon$

Matrix Form

$\text{SSE} = \|y - XW\|_2^2$

Normal Equation

$W = (X^TX)^{-1}X^Ty$

Next Topic

Gradient Descent! ๐Ÿš€

What's Next?

Coming Up Next: Gradient Descent!

Now that we understand linear regression, we need to learn how to actually solve it for large datasets.

๐Ÿ” We'll explore:

  • How gradient descent takes steps toward the minimum
  • Learning rates and convergence
  • Batch vs. stochastic gradient descent
  • Why it's the backbone of deep learning

"The journey from theory to practice begins with optimization!"

- QuiverLearn