Gradient Descent for Linear Regression: The Iterative Solution

Opening Image

The Problem: Finding Optimal Weights

We learned the normal equation gives us exact solutions for linear regression...

Normal Equation:

$W = (X^TX)^{-1}X^Ty$

✓ Exact solution in one step

⚠️ But there's a catch:

Matrix inversion has $O(n^3)$ complexity
10,000 features = 1 trillion operations!
This becomes impractical VERY quickly.

💡 The Alternative: Gradient Descent

Take many small, cheap steps instead of one expensive leap!

Theme Stated

"Instead of computing the exact solution in one expensive step, we can take many small steps toward the solution by following the gradient."

This is the foundation of modern machine learning optimization.

Setup - Three Questions

Before We Dive In

Three Fundamental Questions

❓ 1. What is a gradient?

We need to understand this mathematical concept

❓ 2. How do we compute gradients?

Especially with multiple variables

❓ 3. Why use gradients for optimization?

What makes them so powerful?

Let's answer each systematically! 🚀

Question 1 - What is a Gradient?

What is a Gradient?

Definition

A gradient is the slope of the tangent line to a function at a specific point.

For any point $x$ on function $f(x)$:
• We draw a tangent line at that point
• The slope of this line is the derivative $f'(x)$
• This tells us how steeply the function increases/decreases

Example: $f(x) = x^2$ at $x = 2$

1. Function value: $f(2) = 4$
2. Derivative: $f'(x) = 2x$
3. Gradient at $x=2$: $f'(2) = 4$

At $x = 2$, the function increases with slope 4!

💡 Key Insight:

Finding the slope requires calculus—specifically derivatives!

Question 2 - Computing Gradients (Single Variable)

Computing Gradients: Single Variable

Derivative (One Variable)

The derivative $f'(x)$ represents the slope of $f(x)$ at point $x$.

For linear regression with one variable:

$f(w) = \sum_{i=1}^m (y_i - (w_0 + w_1x_i))^2$

We compute derivatives with respect to each weight:

$\frac{\partial f}{\partial w_0}$ and $\frac{\partial f}{\partial w_1}$

🎯 Simple Case:

One variable = one derivative = slope at that point!

Question 2 - Partial Derivatives

Computing Gradients: Multiple Variables

Partial Derivatives

A partial derivative $\frac{\partial}{\partial x_i}f(\mathbf{x})$ measures how much $f$ changes when we vary only $x_i$ while keeping all other variables constant.

$f(W) = \sum_{i=1}^m (y_i - (w_0 + w_1x_{i1} + \cdots + w_{n-1}x_{in-1}))^2$

We need partial derivatives for each weight in $W$!

Intuition: Standing on a Hillside 🏔️

• $\frac{\partial f}{\partial x}$ = steepness walking east-west
• $\frac{\partial f}{\partial y}$ = steepness walking north-south

Together, they describe the full slope landscape!

Question 2 - Gradient Vector

The Gradient Vector

Gradient Generalizes Derivatives

The gradient is a vector containing all partial derivatives:

$\nabla_\mathbf{x}f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

The $i$-th element is the partial derivative with respect to $x_i$.

💡 Critical Point Property

At a minimum (or maximum), all partial derivatives equal zero:
$\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}$

This is how we found the normal equation!

Question 3 - Why Gradients? Convex Functions

Why Use Gradients for Optimization?

Convex Functions: One Global Minimum

Our linear regression metric $f(W) = \|y - XW\|_2^2$ is convex.

Special Property:

✓ Only ONE minimum—the global minimum
✓ No local minima to get stuck in
✓ Any minimum we find is the BEST solution
✓ Local minimum = Global minimum

✓ Linear Regression

Bowl-shaped surface
Single minimum
Always converges

✗ Neural Networks

Multiple local minima
May get stuck
No guarantee

Catalyst - Directional Derivatives

Proving Why Gradients Work

Directional Derivative

The directional derivative in direction $\mathbf{u}$ (unit vector) tells us:
"How fast does $f$ change if we move in direction $\mathbf{u}$?"

Using the chain rule:

$\frac{\partial}{\partial \alpha} f(\mathbf{x} + \alpha \mathbf{u})$
$= \left(\frac{\partial (\mathbf{x} + \alpha \mathbf{u})}{\partial \alpha}\right)^T \nabla_\mathbf{x}f(\mathbf{x})$
$= \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x})$

🔍 Beautiful Result:

Rate of change in direction $\mathbf{u}$ = dot product of direction and gradient!

Catalyst - Finding Best Direction

Which Direction Minimizes Fastest?

The Optimization Problem

Which direction $\mathbf{u}$ minimizes our function fastest?

$\min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x})$
$= \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \|\mathbf{u}\|_2 \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta$

Since $\mathbf{u}$ is a unit vector ($\|\mathbf{u}\|_2 = 1$):

$\min_{\mathbf{u}} \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta = \min_{\mathbf{u}} \cos \theta$

💡 The Optimal Direction

Minimum of $\cos \theta$ occurs when $\theta = 180°$

$\mathbf{u}$ should point OPPOSITE to the gradient!
$\mathbf{u} = -\frac{\nabla_\mathbf{x}f(\mathbf{x})}{\|\nabla_\mathbf{x}f(\mathbf{x})\|_2}$

Break into Two - The Algorithm

The Gradient Descent Algorithm

🎯 Key Insight

Move in the OPPOSITE direction of the gradient!

This is why it's called gradient DESCENT! ⬇️

Update Rule

General form:

$\mathbf{x}_{new} = \mathbf{x}_{old} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}_{old})$

For linear regression weights:

$W_{new} = W_{old} - \varepsilon \nabla_W f(W_{old})$

where $\varepsilon$ is the learning rate

Fun and Games - Example Iteration

Gradient Descent in Action

Walking Through One Iteration

1. Start: $W = [1.0, 2.0]$, error $f(W) = 10.5$

2. Compute gradient: $\nabla_W f(W) = [2.5, -1.2]$

3. Choose learning rate: $\varepsilon = 0.1$

4. Update:
$W_{new} = [1.0, 2.0] - 0.1 \times [2.5, -1.2]$
$W_{new} = [0.75, 2.12]$

5. New error: $f(W_{new}) = 9.3$ ✓ Decreased!

6. Repeat until convergence!

Midpoint - Learning Rate Challenge

The Critical Question 🤔

How Big Should Our Steps Be?

The learning rate $\varepsilon$ controls how far we move in each iteration.

The Dilemma:

🐌 Too Small

Very slow convergence
Many iterations needed
Computationally expensive

🚀 Too Large

May overshoot minimum
Can diverge/oscillate
Less stable

😰 The Problem:

Finding the right balance is crucial for practical machine learning!

All Is Lost - Learning Rate Strategies

Three Strategies for Choosing Learning Rate

1️⃣ Fixed Small Value

Choose $\varepsilon = 0.001$ or $0.01$
✓ Simple, stable | ✗ May be inefficient

2️⃣ Exact Line Search

Solve: $\min_\varepsilon f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))$
✓ Optimal step size | ✗ Computationally expensive

3️⃣ Backtracking Line Search ⭐

Algorithm:
1. Start with candidate $\varepsilon$ (e.g., 1.0)
2. Evaluate $f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))$
3. If not decreasing enough: $\varepsilon \leftarrow 0.5\varepsilon$
4. Repeat until acceptable decrease

Break into Three - Modern Methods

Modern Adaptive Learning Rate Methods

💡 The Solution

Adaptive methods automatically adjust $\varepsilon$ during training!

Adam

Adapts learning rate per parameter based on gradient history

RMSprop

Uses moving average of squared gradients

AdaGrad

Adapts based on cumulative gradient info

🎯 Impact:

These methods handle learning rate automatically, making deep learning much more practical!

Finale - Complete Picture

Putting It All Together

The Complete Gradient Descent Picture

1️⃣ Problem:
Normal equation: $O(n^3)$ complexity—too expensive!

2️⃣ Solution:
Gradient descent—iterative optimization

3️⃣ How:
Start random → Compute gradient → Go downhill → Repeat

4️⃣ Why:
Convex function → Gradient opposite direction → Guaranteed convergence!

5️⃣ Key Param:
Learning rate $\varepsilon$ controls step size

6️⃣ Modern:
Adaptive methods (Adam, RMSprop) auto-tune!

Comparison - Methods Side-by-Side

Normal Equation vs. Gradient Descent

Normal Equation

$W = (X^TX)^{-1}X^Ty$

✓ Advantages:

Exact solution in one step
No parameters to tune
No iterations needed

✗ Disadvantages:

$O(n^3)$ complexity
Requires matrix inversion
Impractical for large $n$

Use: $n < 10,000$

Gradient Descent

$W_{new} = W_{old} - \varepsilon \nabla_W f(W)$

✓ Advantages:

Scales to large datasets
$O(n)$ per iteration
Memory efficient
Foundation for deep learning

✗ Disadvantages:

Many iterations needed
Must tune learning rate

Use: $n > 10,000$

Key Takeaways - Summary

What We've Learned

✓ Gradient = slope of tangent line; direction of steepest increase

✓ Partial derivatives measure change in one variable (others fixed)

✓ Gradient vector contains all partial derivatives

✓ Convex functions have one global minimum—perfect for GD!

✓ Directional derivative in opposite gradient = steepest descent

✓ Learning rate controls step size & convergence

✓ Gradient descent trades exact solution for efficiency

✓ Foundation of modern machine learning!

Practice Problem 1

Practice Problem: Level 1

One Iteration of Gradient Descent

For $f(x) = x^2 - 4x + 5$, compute one iteration starting from $x = 0$ with learning rate $\varepsilon = 0.1$.

Hint:

1. Compute derivative $f'(x)$
2. Evaluate at $x = 0$
3. Apply update rule: $x_{new} = x_{old} - \varepsilon f'(x_{old})$

Try it yourself before checking the solution! 💪

Practice Problem 2

Practice Problem: Level 2

Conceptual Understanding

Question:

Why is linear regression particularly well-suited for gradient descent compared to other machine learning models? What property makes it reliable?

Hint:

Think about the shape of the error function and the number of minima it has.

Key concept: Convexity! 🏔️

What's Next

Continue Your Learning Journey

Ready to Explore More?

🎯 Stochastic Gradient Descent (SGD)

Computing gradients on mini-batches instead of entire dataset

🚀 Advanced Optimizers

Deep dive into Adam, RMSprop, and momentum methods

🛡️ Regularization

L1 and L2 penalties to prevent overfitting

🧠 Non-linear Models

Extending these concepts to neural networks!

Final Image

Gradient Descent: The Workhorse of Machine Learning

The Math 📐

Gradients point uphill
We go downhill
Convexity guarantees success

The Impact 🌟

Foundation of deep learning
Powers modern AI
Scales to billions of parameters

Master gradient descent, unlock machine learning! 🔑

✓Efficient iterative optimization

✓Scales to massive datasets

✓Guaranteed convergence (convex)

✓Foundation for deep learning

"Gradient descent is the workhorse of modern machine learning. Master it, and you've unlocked the foundation of deep learning!"

- QuiverLearn