Gradient Descent for Linear Regression: The Iterative Solution
In our previous post on linear regression, we discovered the normal equation—a beautiful closed-form solution for finding optimal weights. But there was a catch: computing the matrix inverse is computationally expensive for large datasets. This is where gradient descent comes to the rescue! Let’s explore how this elegant iterative method optimizes our linear regression model efficiently.
The Problem with the Normal Equation
Remember the normal equation we derived? It gives us the exact solution, but at a computational cost that grows rapidly with dataset size.
The optimal weight vector for linear regression is:
This is called the exact solution or closed-form solution because it directly computes the answer.
The matrix inversion has computational complexity of where is the number of features. For large datasets:
- 10,000 features: trillion operations
- 100,000 features: quadrillion operations
This becomes impractical very quickly!
Instead of computing the exact solution in one expensive step, we can use iterative methods that:
- Take many small, cheap steps toward the solution
- Don’t require matrix inversion
- Scale better to large datasets
- Form the foundation of modern machine learning
The most popular iterative method is gradient descent.
Three Fundamental Questions
Before we dive into gradient descent, let’s address three essential questions:
- What is gradient?
- How do we compute gradient?
- Why use gradient for iterative optimization?
Let’s answer each of these systematically.
What is a Gradient?
A gradient is the slope of the tangent line to a function at a specific point.
For any point on a function :
- We can draw a tangent line at that point
- The slope of this tangent line is the derivative
- This slope tells us how steeply the function is increasing or decreasing at that point
Simple Example: One Variable
Consider the function at point :
- The function value is:
- The derivative is:
- At , the gradient is:
This tells us that at , the function is increasing with a slope of 4. If we move slightly to the right (increasing ), the function value will increase approximately 4 times as fast as our step size.
Finding the slope of a line given two points is easy. Finding the slope of a tangent line (which only touches the curve at one point) requires calculus—specifically, derivatives!
The derivative gives us exactly the slope we need at any point .
Computing Gradients: From Derivatives to Partial Derivatives
Single Variable Case
For a function of one variable, computing the gradient is straightforward:
The derivative represents the slope of at point .
For linear regression with one variable, our error function is:
We compute derivatives with respect to each weight: and
Multiple Variables: Partial Derivatives
In linear regression, we typically have multiple weights (one for each feature plus the intercept). How do we compute gradients when there are multiple variables?
A partial derivative measures how much changes when we vary only while keeping all other variables constant.
For our linear regression metric:
We need to compute the partial derivative with respect to each weight in .
Partial Derivative Intuition
Imagine you’re standing on a hillside:
- The partial derivative with respect to x tells you how steep the hill is if you walk east-west
- The partial derivative with respect to y tells you how steep the hill is if you walk north-south
Together, these partial derivatives tell you the full “slope landscape” at your position!
The Gradient Vector
The gradient generalizes the concept of derivative to multiple dimensions. It’s a vector containing all partial derivatives:
The -th element of is the partial derivative of with respect to .
At a minimum (or maximum) of a function, all partial derivatives equal zero:
This is how we found the normal equation—we set the gradient to zero and solved!
Why Use Gradient for Optimization?
Now for the key question: Why does following the gradient help us optimize our function?
Convex Functions: One Global Minimum
Our linear regression metric is a convex function. A convex function has a special property:
It has only one minimum—the global minimum.
This means:
- There are no local minima to get stuck in
- Any minimum we find is the best solution
- Local minimum = Global minimum
Convex Function ✓
Linear Regression Error
- Bowl-shaped surface
- Single minimum point
- Gradient descent always finds it
- Guaranteed convergence
Examples: Sum of squared errors, mean squared error
Non-Convex Function ✗
Complex Error Landscape
- Multiple local minima
- Gradient descent may get stuck
- Solution depends on starting point
- No convergence guarantee
Examples: Neural networks, some polynomial functions
Because our linear regression error function is convex:
- We can start at any random point
- Follow the opposite direction of the gradient (downhill)
- We’re guaranteed to reach the global minimum eventually!
This is why linear regression is so reliable compared to more complex models.
Directional Derivatives: Moving in the Right Direction
Let’s prove mathematically why moving opposite to the gradient minimizes our function.
The directional derivative in direction (a unit vector) is the derivative of with respect to , evaluated at .
It tells us: “How fast does change if we move in direction ?”
Using the chain rule, we can derive:
This beautiful result says: The rate of change in direction is simply the dot product of the direction and the gradient!
Finding the Best Direction
Now, which direction minimizes our function the fastest?
where is the angle between and the gradient.
Since is a unit vector, , and we can simplify:
The minimum of occurs when (or radians).
This means: should point in the opposite direction of the gradient!
Key Insight: To minimize a function, move in the opposite direction of the gradient. This is why the method is called gradient descent—we’re descending down the gradient!
The Gradient Descent Algorithm
Now we can formulate the complete algorithm!
Starting from an arbitrary point , we update our position using:
where is called the learning rate.
For linear regression weights :
Gradient Descent in Action
Let’s walk through one iteration:
- Start: We’re at point with error
- Compute gradient:
- Choose learning rate:
- Update:
- New error: (decreased!)
- Repeat until convergence
After many iterations, we reach the optimal weights!
The Learning Rate: How Big Should Our Steps Be?
The learning rate controls how far we move in each iteration. Choosing it is crucial!
The learning rate determines the step size in gradient descent:
- Too small → very slow convergence
- Too large → may overshoot the minimum or diverge
- Just right → efficient convergence to the minimum
Small Learning Rate
Pros:
- Stable, guaranteed progress
- Won’t overshoot minimum
- More precise convergence
Cons:
- Very slow convergence
- Many iterations needed
- Computationally expensive
Large Learning Rate
Pros:
- Fast initial progress
- Fewer iterations needed
- Computationally efficient
Cons:
- May overshoot minimum
- Can diverge or oscillate
- Less stable
Three Common Strategies for Choosing Learning Rate
Strategy 1: Fixed Small Value
Approach: Choose a small constant value like or
Pros:
- Simple to implement
- Generally stable
- Works well for many problems
Cons:
- May be inefficient (too slow)
- Requires manual tuning
When to use: When you want simplicity and stability, and computational cost is not critical.
Strategy 2: Exact Line Search
Approach: Find where
In other words, solve:
Pros:
- Theoretically optimal step size
- Maximum progress per iteration
Cons:
- Computationally expensive (solving an optimization problem at each step!)
- Often not practical
When to use: Rarely in practice, but useful for theoretical understanding.
Strategy 3: Backtracking Line Search (Most Popular)
Approach: Try different values of and pick the one that gives minimum
Algorithm:
- Start with a candidate (e.g., 1.0)
- Evaluate
- If it’s not decreasing enough, reduce (e.g., )
- Repeat until we find acceptable decrease
Pros:
- Good balance of speed and accuracy
- Adaptive to function landscape
- Used in most modern implementations
Cons:
- More complex than fixed rate
- Requires multiple function evaluations per iteration
When to use: This is the most popular method in practice!
Too Small: Your algorithm will take forever to converge. You might run out of computational budget before reaching the minimum.
Too Large: Your algorithm might:
- Oscillate around the minimum without reaching it
- Diverge completely (error increases instead of decreases!)
- Jump over the optimal solution repeatedly
Finding the right balance is crucial for practical machine learning!
Modern machine learning uses adaptive learning rate methods that automatically adjust during training:
- Adam: Adapts learning rate per parameter based on gradient history
- RMSprop: Uses moving average of squared gradients
- AdaGrad: Adapts based on cumulative gradient information
These methods handle the learning rate problem automatically, making deep learning much more practical!
Putting It All Together
The Complete Gradient Descent Picture:
- Problem: Normal equation is computationally expensive ()
- Solution: Gradient descent—iterative optimization
- How it works:
- Start with random weights
- Compute gradient (direction of steepest increase)
- Move opposite to gradient (go downhill)
- Repeat until convergence
- Why it works:
- Linear regression error is convex
- Gradient points toward maximum increase
- Opposite direction points toward minimum
- Guaranteed to reach global minimum!
- Key parameter: Learning rate controls step size
Comparison: Normal Equation vs. Gradient Descent
Normal Equation
Advantages:
- Exact solution in one step
- No parameters to tune
- No iterations needed
Disadvantages:
- complexity
- Requires matrix inversion
- Impractical for large
- Memory intensive
Use when: Small to medium datasets (n < 10,000)
Gradient Descent
Advantages:
- Scales to large datasets
- per iteration
- Memory efficient
- Foundation for deep learning
Disadvantages:
- Requires many iterations
- Must tune learning rate
- Approximate solution
- Slower to converge
Use when: Large datasets (n > 10,000) or real-time learning
Key Takeaways
What We’ve Learned:
- Gradient = slope of tangent line at a point; direction of steepest increase
- Partial derivatives measure change in one variable while others stay fixed
- Gradient vector contains all partial derivatives; points toward maximum increase
- Convex functions have one global minimum—perfect for gradient descent
- Directional derivative in opposite gradient direction gives steepest descent
- Learning rate controls step size; crucial for convergence speed and stability
- Gradient descent trades exact solution for computational efficiency
This iterative approach forms the foundation of modern machine learning!
Practice Problems
For the function , compute one iteration of gradient descent starting from with learning rate .
Click for hint
First, compute the derivative , then evaluate it at , and finally apply the update rule .
Click for solution
Step 1: Compute derivative
Step 2: Evaluate at
Step 3: Apply update rule
Step 4: Verify improvement
- ✓ (decreased!)
The function value decreased from 5 to 3.56, confirming we moved toward the minimum!
Why is linear regression particularly well-suited for gradient descent compared to other machine learning models? What property makes it reliable?
Click for hint
Think about the shape of the error function and the number of minima it has.
Click for solution
Linear regression is ideal for gradient descent because:
-
Convex error function: The squared error is convex, meaning it has a bowl shape with a single minimum.
-
No local minima: Unlike neural networks or polynomial models, there’s no risk of getting stuck in local minima. Any minimum we find is the global minimum.
-
Smooth gradients: The error function is differentiable everywhere, providing smooth gradients that reliably point toward the minimum.
-
Guaranteed convergence: With proper learning rate, gradient descent is mathematically guaranteed to converge to the optimal solution.
-
Well-behaved landscape: No saddle points, plateaus, or other pathological features that plague more complex models.
Contrast with neural networks: Neural networks have highly non-convex error surfaces with many local minima, making optimization much more challenging. This is why deep learning requires sophisticated techniques like batch normalization, careful initialization, and adaptive learning rates!
Implement a conceptual trace of gradient descent for linear regression with one variable. Given data points , , , starting weights , and learning rate , compute two iterations. What happens to the error?
Click for hint
The gradient of with respect to is and with respect to is .
Click for solution
Initial state:
Iteration 1:
Compute predictions:
- , error:
- , error:
- , error:
SSE =
Compute gradients:
Update weights:
Iteration 2:
Compute predictions:
- , error:
- , error:
- , error:
SSE =
Result: Error decreased from 29 → 12.96! Gradient descent is working!
With more iterations, the error would continue decreasing until reaching the optimal solution: (the true relationship is ).
References
- Goodfellow, I., Bengio, Y., & Courville, A. - Deep Learning (Chapter 4: Numerical Computation)
- Boyd, S., & Vandenberghe, L. - Convex Optimization
- Nocedal, J., & Wright, S. - Numerical Optimization
- Veytsman, B. - Convex and Concave Functions Visualization
What’s Next?
Now that you understand gradient descent for linear regression, you’re ready to explore:
- Stochastic Gradient Descent (SGD): Computing gradients on small batches instead of the entire dataset
- Advanced Optimizers: Adam, RMSprop, and other adaptive methods
- Regularization: L1 and L2 penalties to prevent overfitting
- Non-linear Models: Extending these concepts to neural networks!
Gradient descent is the workhorse of modern machine learning. Master it, and you’ve unlocked the foundation of deep learning!