Oct 26, 2025

Gradient Descent for Linear Regression: The Iterative Solution

Machine Learning Intermediate

#machine-learning #gradient-descent #optimization #calculus #linear-regression

In our previous post on linear regression, we discovered the normal equation—a beautiful closed-form solution for finding optimal weights. But there was a catch: computing the matrix inverse is computationally expensive for large datasets. This is where gradient descent comes to the rescue! Let’s explore how this elegant iterative method optimizes our linear regression model efficiently.

The Problem with the Normal Equation

📊

Remember the normal equation we derived? It gives us the exact solution, but at a computational cost that grows rapidly with dataset size.

Normal Equation (Exact Solution)

The optimal weight vector for linear regression is:

W = (X^TX)^{-1}X^Ty

This is called the exact solution or closed-form solution because it directly computes the answer.

⚠️ The Computational Bottleneck

The matrix inversion $(X^TX)^{-1}$ has computational complexity of $O(n^3)$ where $n$ is the number of features. For large datasets:

10,000 features: $10,000^3 = 1$ trillion operations
100,000 features: $100,000^3 = 1$ quadrillion operations

This becomes impractical very quickly!

💡 The Alternative: Iterative Methods

Instead of computing the exact solution in one expensive step, we can use iterative methods that:

Take many small, cheap steps toward the solution
Don’t require matrix inversion
Scale better to large datasets
Form the foundation of modern machine learning

The most popular iterative method is gradient descent.

Three Fundamental Questions

Before we dive into gradient descent, let’s address three essential questions:

❓

What is gradient?
How do we compute gradient?
Why use gradient for iterative optimization?

Let’s answer each of these systematically.

What is a Gradient?

Gradient

A gradient is the slope of the tangent line to a function at a specific point.

For any point $x$ on a function $f(x)$ :

We can draw a tangent line at that point
The slope of this tangent line is the derivative $f'(x)$
This slope tells us how steeply the function is increasing or decreasing at that point

Simple Example: One Variable

Consider the function $f(x) = x^2$ at point $x = 2$ :

The function value is: $f(2) = 4$
The derivative is: $f'(x) = 2x$
At $x = 2$ , the gradient is: $f'(2) = 4$

This tells us that at $x = 2$ , the function is increasing with a slope of 4. If we move slightly to the right (increasing $x$ ), the function value will increase approximately 4 times as fast as our step size.

💡 Tangent Line and Slope

Finding the slope of a line given two points is easy. Finding the slope of a tangent line (which only touches the curve at one point) requires calculus—specifically, derivatives!

The derivative $f'(x)$ gives us exactly the slope we need at any point $x$ .

Computing Gradients: From Derivatives to Partial Derivatives

Single Variable Case

For a function of one variable, computing the gradient is straightforward:

Derivative (Single Variable)

The derivative $f'(x)$ represents the slope of $f(x)$ at point $x$ .

For linear regression with one variable, our error function is:

f(w) = \sum_{i=1}^m (y_i - (w_0 + w_1x_i))^2

We compute derivatives with respect to each weight: $\frac{\partial f}{\partial w_0}$ and $\frac{\partial f}{\partial w_1}$

Multiple Variables: Partial Derivatives

In linear regression, we typically have multiple weights (one for each feature plus the intercept). How do we compute gradients when there are multiple variables?

Partial Derivative

A partial derivative $\frac{\partial}{\partial x_i}f(\mathbf{x})$ measures how much $f$ changes when we vary only $x_i$ while keeping all other variables constant.

For our linear regression metric:

f(W) = \sum_{i=1}^m (y_i - (w_0 + w_1x_{i1} + w_2x_{i2} + \cdots + w_{n-1}x_{in-1}))^2 = \|y - XW\|_2^2

We need to compute the partial derivative with respect to each weight in $W$ .

Partial Derivative Intuition

Imagine you’re standing on a hillside:

The partial derivative with respect to x tells you how steep the hill is if you walk east-west
The partial derivative with respect to y tells you how steep the hill is if you walk north-south

Together, these partial derivatives tell you the full “slope landscape” at your position!

The Gradient Vector

Gradient Vector

The gradient generalizes the concept of derivative to multiple dimensions. It’s a vector containing all partial derivatives:

\nabla_\mathbf{x}f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The $i$ -th element of $\nabla_\mathbf{x}f(\mathbf{x})$ is the partial derivative of $f$ with respect to $x_i$ .

💡 Critical Point Property

At a minimum (or maximum) of a function, all partial derivatives equal zero:

\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}

This is how we found the normal equation—we set the gradient to zero and solved!

Why Use Gradient for Optimization?

Now for the key question: Why does following the gradient help us optimize our function?

Convex Functions: One Global Minimum

Convex Function

Our linear regression metric $f(W) = \|y - XW\|_2^2$ is a convex function. A convex function has a special property:

It has only one minimum—the global minimum.

This means:

There are no local minima to get stuck in
Any minimum we find is the best solution
Local minimum = Global minimum

Convex Function ✓

Linear Regression Error

Bowl-shaped surface
Single minimum point
Gradient descent always finds it
Guaranteed convergence

Examples: Sum of squared errors, mean squared error

Non-Convex Function ✗

Complex Error Landscape

Multiple local minima
Gradient descent may get stuck
Solution depends on starting point
No convergence guarantee

Examples: Neural networks, some polynomial functions

💡 Why Convexity Matters

Because our linear regression error function is convex:

We can start at any random point
Follow the opposite direction of the gradient (downhill)
We’re guaranteed to reach the global minimum eventually!

This is why linear regression is so reliable compared to more complex models.

Directional Derivatives: Moving in the Right Direction

Let’s prove mathematically why moving opposite to the gradient minimizes our function.

Directional Derivative

The directional derivative in direction $\mathbf{u}$ (a unit vector) is the derivative of $f(\mathbf{x} + \alpha\mathbf{u})$ with respect to $\alpha$ , evaluated at $\alpha = 0$ .

It tells us: “How fast does $f$ change if we move in direction $\mathbf{u}$ ?”

Using the chain rule, we can derive:

\begin{aligned} \frac{\partial}{\partial \alpha} f(\mathbf{x} + \alpha \mathbf{u}) &= \sum_i \frac{\partial f(\mathbf{x} + \alpha \mathbf{u})}{\partial (x_i + \alpha u_i)} \frac{\partial (x_i + \alpha u_i)}{\partial \alpha} \\ &= \left(\frac{\partial (\mathbf{x} + \alpha \mathbf{u})}{\partial \alpha}\right)^T \nabla_\mathbf{x}f(\mathbf{x}) \\ &= \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x}) \end{aligned}

🔍

This beautiful result says: The rate of change in direction $\mathbf{u}$ is simply the dot product of the direction and the gradient!

Finding the Best Direction

Now, which direction $\mathbf{u}$ minimizes our function the fastest?

\begin{aligned} \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \mathbf{u}^T \nabla_\mathbf{x}f(\mathbf{x}) &= \min_{\mathbf{u}, \|\mathbf{u}\|_2=1} \|\mathbf{u}\|_2 \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta \end{aligned}

where $\theta$ is the angle between $\mathbf{u}$ and the gradient.

Since $\mathbf{u}$ is a unit vector, $\|\mathbf{u}\|_2 = 1$ , and we can simplify:

\min_{\mathbf{u}} \|\nabla_\mathbf{x}f(\mathbf{x})\|_2 \cos \theta = \min_{\mathbf{u}} \cos \theta

💡 The Optimal Direction

The minimum of $\cos \theta$ occurs when $\theta = 180°$ (or $\pi$ radians).

This means: $\mathbf{u}$ should point in the opposite direction of the gradient!

\mathbf{u} = -\frac{\nabla_\mathbf{x}f(\mathbf{x})}{\|\nabla_\mathbf{x}f(\mathbf{x})\|_2}

🎯

Key Insight: To minimize a function, move in the opposite direction of the gradient. This is why the method is called gradient descent—we’re descending down the gradient!

The Gradient Descent Algorithm

Now we can formulate the complete algorithm!

Gradient Descent Update Rule

Starting from an arbitrary point $\mathbf{x}$ , we update our position using:

\mathbf{x}_{new} = \mathbf{x}_{old} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}_{old})

where $\varepsilon$ is called the learning rate.

For linear regression weights $W$ :

W_{new} = W_{old} - \varepsilon \nabla_W f(W_{old})

Gradient Descent in Action

Let’s walk through one iteration:

Start: We’re at point $W = [1.0, 2.0]$ with error $f(W) = 10.5$
Compute gradient: $\nabla_W f(W) = [2.5, -1.2]$
Choose learning rate: $\varepsilon = 0.1$
Update: $W_{new} = [1.0, 2.0] - 0.1 \times [2.5, -1.2] = [0.75, 2.12]$
New error: $f(W_{new}) = 9.3$ (decreased!)
Repeat until convergence

After many iterations, we reach the optimal weights!

The Learning Rate: How Big Should Our Steps Be?

The learning rate $\varepsilon$ controls how far we move in each iteration. Choosing it is crucial!

Learning Rate (ε)

The learning rate determines the step size in gradient descent:

Too small → very slow convergence
Too large → may overshoot the minimum or diverge
Just right → efficient convergence to the minimum

Small Learning Rate

Pros:

Stable, guaranteed progress
Won’t overshoot minimum
More precise convergence

Cons:

Very slow convergence
Many iterations needed
Computationally expensive

Large Learning Rate

Pros:

Fast initial progress
Fewer iterations needed
Computationally efficient

Cons:

May overshoot minimum
Can diverge or oscillate
Less stable

Three Common Strategies for Choosing Learning Rate

Strategy 1: Fixed Small Value

Approach: Choose a small constant value like $\varepsilon = 0.001$ or $\varepsilon = 0.01$

Pros:

Simple to implement
Generally stable
Works well for many problems

Cons:

May be inefficient (too slow)
Requires manual tuning

When to use: When you want simplicity and stability, and computational cost is not critical.

Strategy 2: Exact Line Search

Approach: Find $\varepsilon$ where $\nabla_\mathbf{x}f(\mathbf{x}) = \mathbf{0}$

In other words, solve:

\min_\varepsilon f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))

Pros:

Theoretically optimal step size
Maximum progress per iteration

Cons:

Computationally expensive (solving an optimization problem at each step!)
Often not practical

When to use: Rarely in practice, but useful for theoretical understanding.

Strategy 3: Backtracking Line Search (Most Popular)

Approach: Try different values of $\varepsilon$ and pick the one that gives minimum $f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))$

Algorithm:

Start with a candidate $\varepsilon$ (e.g., 1.0)
Evaluate $f(\mathbf{x} - \varepsilon \nabla_\mathbf{x}f(\mathbf{x}))$
If it’s not decreasing enough, reduce $\varepsilon$ (e.g., $\varepsilon \leftarrow 0.5\varepsilon$ )
Repeat until we find acceptable decrease

Pros:

Good balance of speed and accuracy
Adaptive to function landscape
Used in most modern implementations

Cons:

More complex than fixed rate
Requires multiple function evaluations per iteration

When to use: This is the most popular method in practice!

⚠️ Learning Rate Pitfalls

Too Small: Your algorithm will take forever to converge. You might run out of computational budget before reaching the minimum.

Too Large: Your algorithm might:

Oscillate around the minimum without reaching it
Diverge completely (error increases instead of decreases!)
Jump over the optimal solution repeatedly

Finding the right balance is crucial for practical machine learning!

💡 Modern Adaptive Methods

Modern machine learning uses adaptive learning rate methods that automatically adjust $\varepsilon$ during training:

Adam: Adapts learning rate per parameter based on gradient history
RMSprop: Uses moving average of squared gradients
AdaGrad: Adapts based on cumulative gradient information

These methods handle the learning rate problem automatically, making deep learning much more practical!

Putting It All Together

🎓

The Complete Gradient Descent Picture:

Problem: Normal equation is computationally expensive ( $O(n^3)$ )
Solution: Gradient descent—iterative optimization
How it works:
- Start with random weights
- Compute gradient (direction of steepest increase)
- Move opposite to gradient (go downhill)
- Repeat until convergence
Why it works:
- Linear regression error is convex
- Gradient points toward maximum increase
- Opposite direction points toward minimum
- Guaranteed to reach global minimum!
Key parameter: Learning rate controls step size

Comparison: Normal Equation vs. Gradient Descent

Normal Equation

$W = (X^TX)^{-1}X^Ty$

Advantages:

Exact solution in one step
No parameters to tune
No iterations needed

Disadvantages:

$O(n^3)$ complexity
Requires matrix inversion
Impractical for large $n$
Memory intensive

Use when: Small to medium datasets (n < 10,000)

Gradient Descent

$W_{new} = W_{old} - \varepsilon \nabla_W f(W)$

Advantages:

Scales to large datasets
$O(n)$ per iteration
Memory efficient
Foundation for deep learning

Disadvantages:

Requires many iterations
Must tune learning rate
Approximate solution
Slower to converge

Use when: Large datasets (n > 10,000) or real-time learning

Key Takeaways

📝

What We’ve Learned:

Gradient = slope of tangent line at a point; direction of steepest increase
Partial derivatives measure change in one variable while others stay fixed
Gradient vector contains all partial derivatives; points toward maximum increase
Convex functions have one global minimum—perfect for gradient descent
Directional derivative in opposite gradient direction gives steepest descent
Learning rate controls step size; crucial for convergence speed and stability
Gradient descent trades exact solution for computational efficiency

This iterative approach forms the foundation of modern machine learning!

Practice Problems

Level 1:

For the function $f(x) = x^2 - 4x + 5$ , compute one iteration of gradient descent starting from $x = 0$ with learning rate $\varepsilon = 0.1$ .

Click for hint

First, compute the derivative

f'(x)

, then evaluate it at

x = 0

, and finally apply the update rule

x_{new} = x_{old} - \varepsilon f'(x_{old})

Click for solution

Step 1: Compute derivative $f'(x) = 2x - 4$

Step 2: Evaluate at $x = 0$ $f'(0) = 2(0) - 4 = -4$

Step 3: Apply update rule $x_{new} = 0 - 0.1 \times (-4) = 0.4$

Step 4: Verify improvement

$f(0) = 0 - 0 + 5 = 5$
$f(0.4) = 0.16 - 1.6 + 5 = 3.56$ ✓ (decreased!)

The function value decreased from 5 to 3.56, confirming we moved toward the minimum!

Level 2:

Why is linear regression particularly well-suited for gradient descent compared to other machine learning models? What property makes it reliable?

Click for hint

Think about the shape of the error function and the number of minima it has.

Click for solution

Linear regression is ideal for gradient descent because:

Convex error function: The squared error $\|y - XW\|_2^2$ is convex, meaning it has a bowl shape with a single minimum.
No local minima: Unlike neural networks or polynomial models, there’s no risk of getting stuck in local minima. Any minimum we find is the global minimum.
Smooth gradients: The error function is differentiable everywhere, providing smooth gradients that reliably point toward the minimum.
Guaranteed convergence: With proper learning rate, gradient descent is mathematically guaranteed to converge to the optimal solution.
Well-behaved landscape: No saddle points, plateaus, or other pathological features that plague more complex models.

Contrast with neural networks: Neural networks have highly non-convex error surfaces with many local minima, making optimization much more challenging. This is why deep learning requires sophisticated techniques like batch normalization, careful initialization, and adaptive learning rates!

Level 3:

Implement a conceptual trace of gradient descent for linear regression with one variable. Given data points $(1, 3)$ , $(2, 5)$ , $(3, 7)$ , starting weights $w_0 = 0, w_1 = 1$ , and learning rate $\varepsilon = 0.01$ , compute two iterations. What happens to the error?

Click for hint

The gradient of

f(W) = \sum_i (y_i - (w_0 + w_1x_i))^2

with respect to

w_0

-2\sum_i (y_i - (w_0 + w_1x_i))

and with respect to

w_1

-2\sum_i x_i(y_i - (w_0 + w_1x_i))

Click for solution

Initial state: $w_0 = 0, w_1 = 1$

Iteration 1:

Compute predictions:

$\hat{y}_1 = 0 + 1(1) = 1$ , error: $e_1 = 3 - 1 = 2$
$\hat{y}_2 = 0 + 1(2) = 2$ , error: $e_2 = 5 - 2 = 3$
$\hat{y}_3 = 0 + 1(3) = 3$ , error: $e_3 = 7 - 3 = 4$

SSE = $2^2 + 3^2 + 4^2 = 4 + 9 + 16 = 29$

Compute gradients: $\frac{\partial f}{\partial w_0} = -2(2 + 3 + 4) = -18$ $\frac{\partial f}{\partial w_1} = -2(1 \cdot 2 + 2 \cdot 3 + 3 \cdot 4) = -2(2 + 6 + 12) = -40$

Update weights: $w_0^{new} = 0 - 0.01(-18) = 0.18$ $w_1^{new} = 1 - 0.01(-40) = 1.40$

Iteration 2: $w_0 = 0.18, w_1 = 1.40$

Compute predictions:

$\hat{y}_1 = 0.18 + 1.40(1) = 1.58$ , error: $e_1 = 3 - 1.58 = 1.42$
$\hat{y}_2 = 0.18 + 1.40(2) = 2.98$ , error: $e_2 = 5 - 2.98 = 2.02$
$\hat{y}_3 = 0.18 + 1.40(3) = 4.38$ , error: $e_3 = 7 - 4.38 = 2.62$

SSE = $1.42^2 + 2.02^2 + 2.62^2 = 2.02 + 4.08 + 6.86 = 12.96$

Result: Error decreased from 29 → 12.96! Gradient descent is working!

With more iterations, the error would continue decreasing until reaching the optimal solution: $w_0 = 1, w_1 = 2$ (the true relationship is $y = 1 + 2x$ ).

References

Goodfellow, I., Bengio, Y., & Courville, A. - Deep Learning (Chapter 4: Numerical Computation)
Boyd, S., & Vandenberghe, L. - Convex Optimization
Nocedal, J., & Wright, S. - Numerical Optimization
Veytsman, B. - Convex and Concave Functions Visualization

What’s Next?

Now that you understand gradient descent for linear regression, you’re ready to explore:

Stochastic Gradient Descent (SGD): Computing gradients on small batches instead of the entire dataset
Advanced Optimizers: Adam, RMSprop, and other adaptive methods
Regularization: L1 and L2 penalties to prevent overfitting
Non-linear Models: Extending these concepts to neural networks!

Gradient descent is the workhorse of modern machine learning. Master it, and you’ve unlocked the foundation of deep learning!

Gradient Descent for Linear Regression: The Iterative Solution

Simple Example: One Variable

Partial Derivative Intuition

Convex Function ✓

Non-Convex Function ✗

Gradient Descent in Action

Small Learning Rate

Large Learning Rate

Strategy 1: Fixed Small Value

Strategy 2: Exact Line Search

Strategy 3: Backtracking Line Search (Most Popular)

Normal Equation

Gradient Descent

You might also like

Understanding Linear Regression: From Marketing to Mathematics

The Fundamental Theorem of Calculus