Gradient Descent for Linear Regression: The Iterative Solution

📊

Also Available as Interactive Presentation

Learn visually with slides and animations

View Presentation

In our previous post on linear regression, we discovered the normal equation—a beautiful closed-form solution for finding optimal weights. But there was a catch: computing the matrix inverse is computationally expensive for large datasets. This is where gradient descent comes to the rescue! Let’s explore how this elegant iterative method optimizes our linear regression model efficiently.

The Problem with the Normal Equation

📊

Remember the normal equation we derived? It gives us the exact solution, but at a computational cost that grows rapidly with dataset size.

Normal Equation (Exact Solution)

The optimal weight vector for linear regression is:

W=(XTX)1XTyW = (X^TX)^{-1}X^Ty

This is called the exact solution or closed-form solution because it directly computes the answer.

⚠️ The Computational Bottleneck

The matrix inversion (XTX)1(X^TX)^{-1} has computational complexity of O(n3)O(n^3) where nn is the number of features. For large datasets:

  • 10,000 features: 10,0003=110,000^3 = 1 trillion operations
  • 100,000 features: 100,0003=1100,000^3 = 1 quadrillion operations

This becomes impractical very quickly!

💡 The Alternative: Iterative Methods

Instead of computing the exact solution in one expensive step, we can use iterative methods that:

  • Take many small, cheap steps toward the solution
  • Don’t require matrix inversion
  • Scale better to large datasets
  • Form the foundation of modern machine learning

The most popular iterative method is gradient descent.

Three Fundamental Questions

Before we dive into gradient descent, let’s address three essential questions:

  1. What is gradient?
  2. How do we compute gradient?
  3. Why use gradient for iterative optimization?

Let’s answer each of these systematically.

What is a Gradient?

Gradient

A gradient is the slope of the tangent line to a function at a specific point.

For any point xx on a function f(x)f(x):

  • We can draw a tangent line at that point
  • The slope of this tangent line is the derivative f(x)f'(x)
  • This slope tells us how steeply the function is increasing or decreasing at that point

Simple Example: One Variable

Consider the function f(x)=x2f(x) = x^2 at point x=2x = 2:

  1. The function value is: f(2)=4f(2) = 4
  2. The derivative is: f(x)=2xf'(x) = 2x
  3. At x=2x = 2, the gradient is: f(2)=4f'(2) = 4

This tells us that at x=2x = 2, the function is increasing with a slope of 4. If we move slightly to the right (increasing xx), the function value will increase approximately 4 times as fast as our step size.

💡 Tangent Line and Slope

Finding the slope of a line given two points is easy. Finding the slope of a tangent line (which only touches the curve at one point) requires calculus—specifically, derivatives!

The derivative f(x)f'(x) gives us exactly the slope we need at any point xx.

Computing Gradients: From Derivatives to Partial Derivatives

Single Variable Case

For a function of one variable, computing the gradient is straightforward:

Derivative (Single Variable)

The derivative f(x)f'(x) represents the slope of f(x)f(x) at point xx.

For linear regression with one variable, our error function is:

f(w)=i=1m(yi(w0+w1xi))2f(w) = \sum_{i=1}^m (y_i - (w_0 + w_1x_i))^2

We compute derivatives with respect to each weight: fw0\frac{\partial f}{\partial w_0} and fw1\frac{\partial f}{\partial w_1}

Multiple Variables: Partial Derivatives

In linear regression, we typically have multiple weights (one for each feature plus the intercept). How do we compute gradients when there are multiple variables?

Partial Derivative

A partial derivative xif(x)\frac{\partial}{\partial x_i}f(\mathbf{x}) measures how much ff changes when we vary only xix_i while keeping all other variables constant.

For our linear regression metric:

f(W)=i=1m(yi(w0+w1xi1+w2xi2++wn1xin1))2=yXW22f(W) = \sum_{i=1}^m (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2 = \|y - XW\|_2^2

We need to compute the partial derivative with respect to each weight in WW.

Partial Derivative Intuition

Imagine you’re standing on a hillside:

  • The partial derivative with respect to x tells you how steep the hill is if you walk east-west
  • The partial derivative with respect to y tells you how steep the hill is if you walk north-south

Together, these partial derivatives tell you the full “slope landscape” at your position!

The Gradient Vector

Gradient Vector

The gradient generalizes the concept of derivative to multiple dimensions. It’s a vector containing all partial derivatives:

xf(x)=[fx1fx2fxn]\nabla_\mathbf{x}f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The ii-th element of xf(x)\nabla_\mathbf{x}f(\mathbf{x}) is the partial derivative of ff with respect to xix_i.

💡 Critical Point Property

At a minimum (or maximum) of a function, all partial derivatives equal zero:

xf(x)=0\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}

This is how we found the normal equation—we set the gradient to zero and solved!

Why Use Gradient for Optimization?

Now for the key question: Why does following the gradient help us optimize our function?

Convex Functions: One Global Minimum

Convex Function

Our linear regression metric f(W)=yXW22f(W) = \|y - XW\|_2^2 is a convex function. A convex function has a special property:

It has only one minimum—the global minimum.

This means:

  • There are no local minima to get stuck in
  • Any minimum we find is the best solution
  • Local minimum = Global minimum

Convex Function ✓

Linear Regression Error

  • Bowl-shaped surface
  • Single minimum point
  • Gradient descent always finds it
  • Guaranteed convergence

Examples: Sum of squared errors, mean squared error

Non-Convex Function ✗

Complex Error Landscape

  • Multiple local minima
  • Gradient descent may get stuck
  • Solution depends on starting point
  • No convergence guarantee

Examples: Neural networks, some polynomial functions

💡 Why Convexity Matters

Because our linear regression error function is convex:

  1. We can start at any random point
  2. Follow the opposite direction of the gradient (downhill)
  3. We’re guaranteed to reach the global minimum eventually!

This is why linear regression is so reliable compared to more complex models.

Directional Derivatives: Moving in the Right Direction

Let’s prove mathematically why moving opposite to the gradient minimizes our function.

Directional Derivative

The directional derivative in direction u\mathbf{u} (a unit vector) is the derivative of f(x+αu)f(\mathbf{x} + \alpha\mathbf{u}) with respect to α\alpha, evaluated at α=0\alpha = 0.

It tells us: “How fast does ff change if we move in direction u\mathbf{u}?”

Using the chain rule, we can derive:

αf(x+αu)=if(x+αu)(xi+αui)(xi+αui)α=((x+αu)α)Txf(x)=uTxf(x)\begin{aligned} \frac{\partial}{\partial \alpha} f(\mathbf{x} + \alpha \mathbf{u}) &= \sum_i \frac{\partial f(\mathbf{x} + \alpha \mathbf{u})}{\partial (x_i + \alpha u_i)} \frac{\partial (x_i + \alpha u_i)}{\partial \alpha} \\ &= \left(\frac{\partial (\mathbf{x} + \alpha \mathbf{u})}{\partial \alpha}\right)^T \nabla_\mathbf{x}f(\mathbf{x}) \\ &= \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x}) \end{aligned}
🔍

This beautiful result says: The rate of change in direction u\mathbf{u} is simply the dot product of the direction and the gradient!

Finding the Best Direction

Now, which direction u\mathbf{u} minimizes our function the fastest?

minu,u2=1uTxf(x)=minu,u2=1u2xf(x)2cosθ\begin{aligned} \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x}) &= \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \|\mathbf{u}\|_2 \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta \end{aligned}

where θ\theta is the angle between u\mathbf{u} and the gradient.

Since u\mathbf{u} is a unit vector, u2=1\|\mathbf{u}\|_2 = 1, and we can simplify:

minuxf(x)2cosθ=minucosθ\min_{\mathbf{u}} \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta = \min_{\mathbf{u}} \cos \theta
💡 The Optimal Direction

The minimum of cosθ\cos \theta occurs when θ=180°\theta = 180° (or π\pi radians).

This means: u\mathbf{u} should point in the opposite direction of the gradient!

u=xf(x)xf(x)2\mathbf{u} = -\frac{\nabla_\mathbf{x}f(\mathbf{x})}{\|\nabla_\mathbf{x}f(\mathbf{x})\|_2}
🎯

Key Insight: To minimize a function, move in the opposite direction of the gradient. This is why the method is called gradient descent—we’re descending down the gradient!

The Gradient Descent Algorithm

Now we can formulate the complete algorithm!

Gradient Descent Update Rule

Starting from an arbitrary point x\mathbf{x}, we update our position using:

xnew=xoldεxf(xold)\mathbf{x}_{new} = \mathbf{x}_{old} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}_{old})

where ε\varepsilon is called the learning rate.

For linear regression weights WW:

Wnew=WoldεWf(Wold)W_{new} = W_{old} - \varepsilon \nabla_W f(W_{old})

Gradient Descent in Action

Let’s walk through one iteration:

  1. Start: We’re at point W=[1.0,2.0]W = [1.0, 2.0] with error f(W)=10.5f(W) = 10.5
  2. Compute gradient: Wf(W)=[2.5,1.2]\nabla_W f(W) = [2.5, -1.2]
  3. Choose learning rate: ε=0.1\varepsilon = 0.1
  4. Update: Wnew=[1.0,2.0]0.1×[2.5,1.2]=[0.75,2.12]W_{new} = [1.0, 2.0] - 0.1 \times [2.5, -1.2] = [0.75, 2.12]
  5. New error: f(Wnew)=9.3f(W_{new}) = 9.3 (decreased!)
  6. Repeat until convergence

After many iterations, we reach the optimal weights!

The Learning Rate: How Big Should Our Steps Be?

The learning rate ε\varepsilon controls how far we move in each iteration. Choosing it is crucial!

Learning Rate (ε)

The learning rate determines the step size in gradient descent:

  • Too small → very slow convergence
  • Too large → may overshoot the minimum or diverge
  • Just right → efficient convergence to the minimum

Small Learning Rate

Pros:

  • Stable, guaranteed progress
  • Won’t overshoot minimum
  • More precise convergence

Cons:

  • Very slow convergence
  • Many iterations needed
  • Computationally expensive

Large Learning Rate

Pros:

  • Fast initial progress
  • Fewer iterations needed
  • Computationally efficient

Cons:

  • May overshoot minimum
  • Can diverge or oscillate
  • Less stable

Three Common Strategies for Choosing Learning Rate

Strategy 1: Fixed Small Value

Approach: Choose a small constant value like ε=0.001\varepsilon = 0.001 or ε=0.01\varepsilon = 0.01

Pros:

  • Simple to implement
  • Generally stable
  • Works well for many problems

Cons:

  • May be inefficient (too slow)
  • Requires manual tuning

When to use: When you want simplicity and stability, and computational cost is not critical.

Strategy 2: Exact Line Search

Approach: Find ε\varepsilon where xf(x)=0\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}

In other words, solve:

minεf(xεxf(x))\min_\varepsilon f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))

Pros:

  • Theoretically optimal step size
  • Maximum progress per iteration

Cons:

  • Computationally expensive (solving an optimization problem at each step!)
  • Often not practical

When to use: Rarely in practice, but useful for theoretical understanding.

Strategy 3: Backtracking Line Search (Most Popular)

Approach: Try different values of ε\varepsilon and pick the one that gives minimum f(xεxf(x))f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))

Algorithm:

  1. Start with a candidate ε\varepsilon (e.g., 1.0)
  2. Evaluate f(xεxf(x))f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))
  3. If it’s not decreasing enough, reduce ε\varepsilon (e.g., ε0.5ε\varepsilon \leftarrow 0.5\varepsilon)
  4. Repeat until we find acceptable decrease

Pros:

  • Good balance of speed and accuracy
  • Adaptive to function landscape
  • Used in most modern implementations

Cons:

  • More complex than fixed rate
  • Requires multiple function evaluations per iteration

When to use: This is the most popular method in practice!

⚠️ Learning Rate Pitfalls

Too Small: Your algorithm will take forever to converge. You might run out of computational budget before reaching the minimum.

Too Large: Your algorithm might:

  • Oscillate around the minimum without reaching it
  • Diverge completely (error increases instead of decreases!)
  • Jump over the optimal solution repeatedly

Finding the right balance is crucial for practical machine learning!

💡 Modern Adaptive Methods

Modern machine learning uses adaptive learning rate methods that automatically adjust ε\varepsilon during training:

  • Adam: Adapts learning rate per parameter based on gradient history
  • RMSprop: Uses moving average of squared gradients
  • AdaGrad: Adapts based on cumulative gradient information

These methods handle the learning rate problem automatically, making deep learning much more practical!

Putting It All Together

🎓

The Complete Gradient Descent Picture:

  1. Problem: Normal equation is computationally expensive (O(n3)O(n^3))
  2. Solution: Gradient descent—iterative optimization
  3. How it works:
    • Start with random weights
    • Compute gradient (direction of steepest increase)
    • Move opposite to gradient (go downhill)
    • Repeat until convergence
  4. Why it works:
    • Linear regression error is convex
    • Gradient points toward maximum increase
    • Opposite direction points toward minimum
    • Guaranteed to reach global minimum!
  5. Key parameter: Learning rate controls step size

Comparison: Normal Equation vs. Gradient Descent

Normal Equation

W=(XTX)1XTyW = (X^TX)^{-1}X^Ty

Advantages:

  • Exact solution in one step
  • No parameters to tune
  • No iterations needed

Disadvantages:

  • O(n3)O(n^3) complexity
  • Requires matrix inversion
  • Impractical for large nn
  • Memory intensive

Use when: Small to medium datasets (n < 10,000)

Gradient Descent

Wnew=WoldεWf(W)W_{new} = W_{old} - \varepsilon \nabla_W f(W)

Advantages:

  • Scales to large datasets
  • O(n)O(n) per iteration
  • Memory efficient
  • Foundation for deep learning

Disadvantages:

  • Requires many iterations
  • Must tune learning rate
  • Approximate solution
  • Slower to converge

Use when: Large datasets (n > 10,000) or real-time learning

Key Takeaways

📝

What We’ve Learned:

  1. Gradient = slope of tangent line at a point; direction of steepest increase
  2. Partial derivatives measure change in one variable while others stay fixed
  3. Gradient vector contains all partial derivatives; points toward maximum increase
  4. Convex functions have one global minimum—perfect for gradient descent
  5. Directional derivative in opposite gradient direction gives steepest descent
  6. Learning rate controls step size; crucial for convergence speed and stability
  7. Gradient descent trades exact solution for computational efficiency

This iterative approach forms the foundation of modern machine learning!

Practice Problems

Level 1:

For the function f(x)=x24x+5f(x) = x^2 - 4x + 5, compute one iteration of gradient descent starting from x=0x = 0 with learning rate ε=0.1\varepsilon = 0.1.

Click for hint First, compute the derivative f(x)f'(x), then evaluate it at x=0x = 0, and finally apply the update rule xnew=xoldεf(xold)x_{new} = x_{old} - \varepsilon f'(x_{old}).

Click for solution

Step 1: Compute derivative f(x)=2x4f'(x) = 2x - 4

Step 2: Evaluate at x=0x = 0 f(0)=2(0)4=4f'(0) = 2(0) - 4 = -4

Step 3: Apply update rule xnew=00.1×(4)=0.4x_{new} = 0 - 0.1 \times (-4) = 0.4

Step 4: Verify improvement

  • f(0)=00+5=5f(0) = 0 - 0 + 5 = 5
  • f(0.4)=0.161.6+5=3.56f(0.4) = 0.16 - 1.6 + 5 = 3.56 ✓ (decreased!)

The function value decreased from 5 to 3.56, confirming we moved toward the minimum!

Level 2:

Why is linear regression particularly well-suited for gradient descent compared to other machine learning models? What property makes it reliable?

Click for hint Think about the shape of the error function and the number of minima it has.

Click for solution

Linear regression is ideal for gradient descent because:

  1. Convex error function: The squared error yXW22\|y - XW\|_2^2 is convex, meaning it has a bowl shape with a single minimum.

  2. No local minima: Unlike neural networks or polynomial models, there’s no risk of getting stuck in local minima. Any minimum we find is the global minimum.

  3. Smooth gradients: The error function is differentiable everywhere, providing smooth gradients that reliably point toward the minimum.

  4. Guaranteed convergence: With proper learning rate, gradient descent is mathematically guaranteed to converge to the optimal solution.

  5. Well-behaved landscape: No saddle points, plateaus, or other pathological features that plague more complex models.

Contrast with neural networks: Neural networks have highly non-convex error surfaces with many local minima, making optimization much more challenging. This is why deep learning requires sophisticated techniques like batch normalization, careful initialization, and adaptive learning rates!

Level 3:

Implement a conceptual trace of gradient descent for linear regression with one variable. Given data points (1,3)(1, 3), (2,5)(2, 5), (3,7)(3, 7), starting weights w0=0,w1=1w_0 = 0, w_1 = 1, and learning rate ε=0.01\varepsilon = 0.01, compute two iterations. What happens to the error?

Click for hint The gradient of f(W)=i(yi(w0+w1xi))2f(W) = \sum_i (y_i - (w_0 + w_1x_i))^2 with respect to w0w_0 is 2i(yi(w0+w1xi))-2\sum_i (y_i - (w_0 + w_1x_i)) and with respect to w1w_1 is 2ixi(yi(w0+w1xi))-2\sum_i x_i(y_i - (w_0 + w_1x_i)).

Click for solution

Initial state: w0=0,w1=1w_0 = 0, w_1 = 1

Iteration 1:

Compute predictions:

  • y^1=0+1(1)=1\hat{y}_1 = 0 + 1(1) = 1, error: e1=31=2e_1 = 3 - 1 = 2
  • y^2=0+1(2)=2\hat{y}_2 = 0 + 1(2) = 2, error: e2=52=3e_2 = 5 - 2 = 3
  • y^3=0+1(3)=3\hat{y}_3 = 0 + 1(3) = 3, error: e3=73=4e_3 = 7 - 3 = 4

SSE = 22+32+42=4+9+16=292^2 + 3^2 + 4^2 = 4 + 9 + 16 = 29

Compute gradients: fw0=2(2+3+4)=18\frac{\partial f}{\partial w_0} = -2(2 + 3 + 4) = -18 fw1=2(12+23+34)=2(2+6+12)=40\frac{\partial f}{\partial w_1} = -2(1 \cdot 2 + 2 \cdot 3 + 3 \cdot 4) = -2(2 + 6 + 12) = -40

Update weights: w0new=00.01(18)=0.18w_0^{new} = 0 - 0.01(-18) = 0.18 w1new=10.01(40)=1.40w_1^{new} = 1 - 0.01(-40) = 1.40

Iteration 2: w0=0.18,w1=1.40w_0 = 0.18, w_1 = 1.40

Compute predictions:

  • y^1=0.18+1.40(1)=1.58\hat{y}_1 = 0.18 + 1.40(1) = 1.58, error: e1=31.58=1.42e_1 = 3 - 1.58 = 1.42
  • y^2=0.18+1.40(2)=2.98\hat{y}_2 = 0.18 + 1.40(2) = 2.98, error: e2=52.98=2.02e_2 = 5 - 2.98 = 2.02
  • y^3=0.18+1.40(3)=4.38\hat{y}_3 = 0.18 + 1.40(3) = 4.38, error: e3=74.38=2.62e_3 = 7 - 4.38 = 2.62

SSE = 1.422+2.022+2.622=2.02+4.08+6.86=12.961.42^2 + 2.02^2 + 2.62^2 = 2.02 + 4.08 + 6.86 = 12.96

Result: Error decreased from 29 → 12.96! Gradient descent is working!

With more iterations, the error would continue decreasing until reaching the optimal solution: w0=1,w1=2w_0 = 1, w_1 = 2 (the true relationship is y=1+2xy = 1 + 2x).

References

  1. Goodfellow, I., Bengio, Y., & Courville, A. - Deep Learning (Chapter 4: Numerical Computation)
  2. Boyd, S., & Vandenberghe, L. - Convex Optimization
  3. Nocedal, J., & Wright, S. - Numerical Optimization
  4. Veytsman, B. - Convex and Concave Functions Visualization

What’s Next?

Now that you understand gradient descent for linear regression, you’re ready to explore:

  • Stochastic Gradient Descent (SGD): Computing gradients on small batches instead of the entire dataset
  • Advanced Optimizers: Adam, RMSprop, and other adaptive methods
  • Regularization: L1 and L2 penalties to prevent overfitting
  • Non-linear Models: Extending these concepts to neural networks!

Gradient descent is the workhorse of modern machine learning. Master it, and you’ve unlocked the foundation of deep learning!