We learned the normal equation gives us exact solutions for linear regression...
Normal Equation:
$W = (X^TX)^{-1}X^Ty$
✓ Exact solution in one step
⚠️ But there's a catch:
Matrix inversion has $O(n^3)$ complexity
10,000 features = 1 trillion
operations!
This becomes impractical VERY quickly.
💡 The Alternative: Gradient Descent
Take many small, cheap steps instead of one expensive leap!
"Instead of computing the exact solution in one expensive step, we can take many small steps toward the solution by following the gradient."
This is the foundation of modern machine learning optimization.
❓ 1. What is a gradient?
We need to understand this mathematical concept
❓ 2. How do we compute gradients?
Especially with multiple variables
❓ 3. Why use gradients for optimization?
What makes them so powerful?
Let's answer each systematically! 🚀
A gradient is the slope of the tangent line to a function at a specific
point.
For any point $x$ on function $f(x)$:
• We draw a tangent line at that point
• The slope of
this line is the derivative $f'(x)$
• This tells us how steeply the function increases/decreases
1. Function value: $f(2) = 4$
2. Derivative: $f'(x) = 2x$
3. Gradient at
$x=2$: $f'(2) = 4$
At $x = 2$, the function increases with slope 4!
💡 Key Insight:
Finding the slope requires calculus—specifically derivatives!
The derivative $f'(x)$ represents the slope of $f(x)$ at point $x$.
For linear regression with one variable:
$f(w) = \sum_{i=1}^m (y_i - (w_0 + w_1x_i))^2$
We compute derivatives with respect to each weight:
$\frac{\partial f}{\partial w_0}$ and $\frac{\partial f}{\partial w_1}$
🎯 Simple Case:
One variable = one derivative = slope at that point!
A partial derivative $\frac{\partial}{\partial x_i}f(\mathbf{x})$ measures how much $f$ changes when we vary only $x_i$ while keeping all other variables constant.
$f(W) = \sum_{i=1}^m (y_i - (w_0 + w_1x_{i1} + \cdots + w_{n-1}x_{in-1}))^2$
We need partial derivatives for each weight in $W$!
• $\frac{\partial f}{\partial x}$ = steepness walking east-west
•
$\frac{\partial f}{\partial y}$ = steepness walking north-south
Together, they describe the full slope
landscape!
The gradient is a vector containing all partial derivatives:
$\nabla_\mathbf{x}f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$
The $i$-th element is the partial derivative with respect to $x_i$.
💡 Critical Point Property
At a minimum (or maximum), all partial derivatives equal
zero:
$\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}$
This is how we found the normal equation!
Our linear regression metric $f(W) = \|y - XW\|_2^2$ is convex.
Special Property:
✓ Only ONE minimum—the global minimum
✓ No local minima to get stuck in
✓
Any minimum we find is the BEST solution
✓ Local minimum = Global minimum
✓ Linear Regression
Bowl-shaped surface
Single minimum
Always converges
✗ Neural Networks
Multiple local minima
May get stuck
No guarantee
The directional derivative in direction
$\mathbf{u}$ (unit vector) tells us:
"How fast does $f$ change if we move in direction $\mathbf{u}$?"
Using the chain rule:
$\frac{\partial}{\partial \alpha} f(\mathbf{x} + \alpha
\mathbf{u})$
$= \left(\frac{\partial (\mathbf{x} + \alpha \mathbf{u})}{\partial \alpha}\right)^T
\nabla_\mathbf{x}f(\mathbf{x})$
$= \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x})$
🔍 Beautiful Result:
Rate of change in direction $\mathbf{u}$ = dot product of direction and gradient!
Which direction $\mathbf{u}$ minimizes our function fastest?
$\min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \mathbf{u}^T
\nabla_\mathbf{x}f(\mathbf{x})$
$= \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \|\mathbf{u}\|_2
\|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta$
Since $\mathbf{u}$ is a unit vector ($\|\mathbf{u}\|_2 = 1$):
$\min_{\mathbf{u}} \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta = \min_{\mathbf{u}} \cos \theta$
💡 The Optimal Direction
Minimum of $\cos \theta$ occurs when $\theta = 180°$
$\mathbf{u}$
should point OPPOSITE to the gradient!
$\mathbf{u} =
-\frac{\nabla_\mathbf{x}f(\mathbf{x})}{\|\nabla_\mathbf{x}f(\mathbf{x})\|_2}$
Move in the OPPOSITE direction of the gradient!
This is why it's called gradient DESCENT! ⬇️
General form:
$\mathbf{x}_{new} = \mathbf{x}_{old} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}_{old})$
For linear regression weights:
$W_{new} = W_{old} - \varepsilon \nabla_W f(W_{old})$
where $\varepsilon$ is the learning rate
1. Start: $W = [1.0, 2.0]$, error $f(W) = 10.5$
2. Compute gradient: $\nabla_W f(W) = [2.5, -1.2]$
3. Choose learning rate: $\varepsilon = 0.1$
4. Update:
$W_{new} = [1.0,
2.0] - 0.1 \times [2.5, -1.2]$
$W_{new} = [0.75, 2.12]$
5. New error: $f(W_{new}) = 9.3$ ✓ Decreased!
6. Repeat until convergence!
The learning rate $\varepsilon$ controls how far we move in each iteration.
The Dilemma:
🐌 Too Small
Very slow convergence
Many iterations needed
Computationally expensive
🚀 Too Large
May overshoot minimum
Can diverge/oscillate
Less stable
😰 The Problem:
Finding the right balance is crucial for practical machine learning!
Choose $\varepsilon = 0.001$ or $0.01$
✓ Simple, stable | ✗ May be inefficient
Solve: $\min_\varepsilon f(\mathbf{x} - \varepsilon
\nabla_\mathbf{x}f(\mathbf{x}))$
✓ Optimal step size | ✗ Computationally expensive
Algorithm:
1. Start with candidate $\varepsilon$ (e.g., 1.0)
2.
Evaluate $f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))$
3. If not decreasing enough:
$\varepsilon \leftarrow 0.5\varepsilon$
4. Repeat until acceptable decrease
Most popular in practice!
Adaptive methods automatically adjust $\varepsilon$ during training!
Adam
Adapts learning rate per parameter based on gradient history
RMSprop
Uses moving average of squared gradients
AdaGrad
Adapts based on cumulative gradient info
🎯 Impact:
These methods handle learning rate automatically, making deep learning much more practical!
1️⃣ Problem:
Normal equation: $O(n^3)$
complexity—too expensive!
2️⃣ Solution:
Gradient descent—iterative
optimization
3️⃣ How:
Start random → Compute gradient →
Go downhill → Repeat
4️⃣ Why:
Convex function → Gradient opposite
direction → Guaranteed convergence!
5️⃣ Key Param:
Learning rate $\varepsilon$
controls step size
6️⃣ Modern:
Adaptive methods (Adam, RMSprop)
auto-tune!
$W = (X^TX)^{-1}X^Ty$
✓ Advantages:
Exact solution in one step
No parameters to tune
No iterations needed
✗ Disadvantages:
$O(n^3)$ complexity
Requires matrix inversion
Impractical for large $n$
Use: $n < 10,000$
$W_{new} = W_{old} - \varepsilon \nabla_W f(W)$
✓ Advantages:
Scales to large datasets
$O(n)$ per iteration
Memory
efficient
Foundation for deep learning
✗ Disadvantages:
Many iterations needed
Must tune learning rate
Use: $n > 10,000$
✓ Gradient = slope of tangent line; direction of steepest increase
✓ Partial derivatives measure change in one variable (others fixed)
✓ Gradient vector contains all partial derivatives
✓ Convex functions have one global minimum—perfect for GD!
✓ Directional derivative in opposite gradient = steepest descent
✓ Learning rate controls step size & convergence
✓ Gradient descent trades exact solution for efficiency
✓ Foundation of modern machine learning!
For $f(x) = x^2 - 4x + 5$, compute one iteration starting from $x = 0$ with learning rate $\varepsilon = 0.1$.
Hint:
1. Compute derivative $f'(x)$
2. Evaluate at $x = 0$
3. Apply update rule: $x_{new} =
x_{old} - \varepsilon f'(x_{old})$
Try it yourself before checking the solution! 💪
Question:
Why is linear regression particularly well-suited for gradient descent compared to other machine learning models? What property makes it reliable?
Hint:
Think about the shape of the error function and the number of minima it has.
Key concept: Convexity! 🏔️
🎯 Stochastic Gradient Descent (SGD)
Computing gradients on mini-batches instead of entire dataset
🚀 Advanced Optimizers
Deep dive into Adam, RMSprop, and momentum methods
🛡️ Regularization
L1 and L2 penalties to prevent overfitting
🧠 Non-linear Models
Extending these concepts to neural networks!
Gradients point uphill
We go downhill
Convexity guarantees
success
Foundation of deep learning
Powers modern AI
Scales to billions
of parameters
Master gradient descent, unlock machine learning! 🔑
"Gradient descent is the workhorse of modern machine learning. Master it, and you've unlocked the foundation of deep learning!"
- QuiverLearn