You run marketing for a growing company. Every month, you decide how much to spend on TV ads.
But here's the problem...
๐ฐ The Big Question:
"How much should I spend on advertising next month to hit my $10 million sales target?"
Data from the past
Predict the future
Make decisions
"If we can find a mathematical relationship between our inputs (ad spending) and outputs (sales), we can predict the future."
This is the foundation of linear regression.
| Month | TV Ad Budget ($1000s) | Sales ($ millions) |
|---|---|---|
| January | 120 | 5 |
| February | 125 | 7 |
| March | 140 | 8 |
| April | 110 | 6 |
Pattern emerging? More advertising โ More sales
But can we quantify this relationship? ๐ค
A linear function creates:
$\hat{y} = f(x) = w_0 + w_1x$
$w_0$ = intercept
Where line crosses y-axis
$w_1$ = slope
How steep the line is
Real-world data is messy. Sales don't perfectly follow a line because:
Our model won't be perfectโand that's okay! ๐ฏ
We need to account for uncertainty...
We add a random variable to capture uncertainty:
$\varepsilon$ (epsilon) = error term or residual
Captures everything our model doesn't explain
๐ก Key Insight:
We're looking for useful approximations, not perfect predictions. By acknowledging error, we build more realistic models!
For each data point $(x_i, y_i)$, the error is:
$y_i$
Actual value (true sales)
$(w_0 + w_1x_i)$
Predicted value
There are infinite possible lines through our data. Which one is best?
Find the line that makes the smallest total error across all data points.
But how do we measure "total error"? ๐ค
โ ๏ธ The Challenge:
We need a metric that:
โ Captures error magnitude
โ Doesn't let errors cancel out
โ Can be
minimized mathematically
Sum all the errors:
$\sum e_i = \sum (y_i - (w_0 + w_1x_i))$
Problem: Positive and negative errors cancel out!
Example:
โข Point 1: error = +5 (underestimated)
โข
Point 2: error = -5 (overestimated)
โข Total: +5 + (-5) = 0 โ "Perfect"?
But both predictions were wrong by 5 units!
The Takeaway:
We need a metric where all errors are non-negative so they can't cancel out.
$\sum |e_i|$
Pros:
โ All errors positive
โ Equal weight to all errors
โ Robust to outliers
Cons:
โ Not differentiable at zero
โ Harder to optimize
$\sum e_i^2$
Pros:
โ All errors positive
โ Heavily penalizes outliers
โ Differentiable
everywhere
โ Nice statistical properties
Winner!
Used in linear regression
$\text{SSE} = \sum e_i^2 = \sum (y_i - (w_0 + w_1x_i))^2$
Expand each term and square it!
Small errors
Get small penalties
$1^2 = 1$
Large errors
Get huge penalties
$10^2 = 100$
Sales might depend on multiple factors:
TV Ads
$x_1$
Radio Ads
$x_2$
Social Media
$x_3$
$\hat{y} = w_0 + w_1x_1 + w_2x_2$
Forms a flat plane in 3D space
For $n$ input variables:
$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \cdots + w_{n-1}x_{n-1}$
$\sum e_i^2 = \sum (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$
This notation gets messy fast! ๐ฐ
Solution: Use matrices! โก๏ธ
$\sum_{i=1}^{m} (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2$
Ugly! Hard to work with! ๐ซ
$\text{SSE} = \|y - XW\|_2^2$
Beautiful! Works for any dimension! โจ
$\text{SSE} = \|y - XW\|_2^2$
$X \in \mathbb{R}^{m \times n}$
Input data matrix
$m$ = observations
$n$ = features
$y \in \mathbb{R}^m$
Output vector
$m$ = observations
Actual values
$W \in \mathbb{R}^n$
Weight vector
$n$ = features
Parameters to find
$\|\cdot\|_2^2$ = Squared L2 Norm
Fancy name for: "square each element, then sum"
$\|v\|_2^2 = v_1^2 + v_2^2 +
\cdots + v_n^2$
Use calculus to minimize the error function!
Key insight from calculus:
"At a minimum, the derivative (gradient) equals zero"
Just like finding the bottom of a valley! ๐๏ธ
Start with:
$f(W) = \|y -
XW\|_2^2$
Rewrite as dot product:
$f(W) = (y - XW)^T(y - XW)$
Distribute transpose:
$=
(y^T - W^TX^T)(y - XW)$
FOIL (multiply out):
$=
y^Ty - W^TX^Ty - y^TXW + W^TX^TXW$
Since $W^TX^Ty$ and $y^TXW$ are scalars and equal:
$f(W) = y^Ty - 2y^TXW + W^TX^TXW$
$f(W) = y^Ty - 2y^TXW + W^TX^TXW$
Apply derivative rules:
โข $\nabla_W(y^Ty) = 0$ (constant)
โข $\nabla_W(-2y^TXW) = -2X^Ty$ (linear)
โข $\nabla_W(W^TX^TXW) = 2X^TXW$ (quadratic)
Result:
$\nabla f(W) = -2X^Ty + 2X^TXW$
$\nabla f(W) = -2X^Ty + 2X^TXW = 0$
โ Divide by 2
$-X^Ty + X^TXW = 0$
โ Add $X^Ty$ to both sides
$X^TXW = X^Ty$
$W = (X^TX)^{-1}X^Ty$
Closed-form solution for linear regression!
$W = (X^TX)^{-1}X^Ty$
Looks elegant, but...
1,000 observations โ Invert $1000 \times 1000$ matrix
Time complexity: $O(n^3)$ ๐
100,000 observations โ Invert $100,000 \times 100,000$ matrix
Takes hours or crashes! ๐ฅ
Instead of computing the inverse directly, take small steps toward the minimum!
๐ฏ Advantages:
โ No matrix inversion needed
โ Scales to millions of data points
โ
Foundation of modern ML
โ Works for non-linear models too!
$W = (X^TX)^{-1}X^Ty$
Pros:
โ Direct solution
โ No
hyperparameters
โ Exact answer
Cons:
โ $O(n^3)$ complexity
โ Doesn't
scale
โ Memory intensive
Best for:
Small datasets (< 10k obs)
Iterative updates
Pros:
โ Scales to huge data
โ Low
memory
โ Generalizes to non-linear
Cons:
โ Needs tuning
โ Approximate
solution
โ Takes iterations
Best for:
Large datasets, deep learning
Find a linear relationship between inputs and outputs to predict future values
Add $\varepsilon$ to account for randomness and unmeasured factors
Use Sum of Squared Errors (SSE) to measure fit quality
Normal Equation (small data) or Gradient Descent (large data)
Linear Model
$\hat{y} = w_0 + w_1x + \varepsilon$
Matrix Form
$\text{SSE} = \|y - XW\|_2^2$
Normal Equation
$W = (X^TX)^{-1}X^Ty$
Next Topic
Gradient Descent! ๐
Now that we understand linear regression, we need to learn how to actually solve it for large datasets.
๐ We'll explore:
"The journey from theory to practice begins with optimization!"
- QuiverLearn