This guide explores multiple linear regression from first principles, focusing on the mathematical foundations rather than just applying algorithms. While simple linear regression uses one independent variable to predict one target, multiple linear regression extends this to two or more independent variables—requiring us to fit a plane instead of a line.
To illustrate the concepts, we use the Fish Market dataset, which includes physical attributes of fish:
- Species: The type of fish (e.g., Bream, Roach, Pike)
- Weight: The weight of the fish in grams (target variable)
- Length1, Length2, Length3: Various length measurements in cm
- Height: The height of the fish in cm
- Width: The diagonal width of the fish body in cm
For simplicity and visualization, this guide uses two independent variables (Height and Width) and a 20-point sample from the full dataset.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# 20-point sample data from Fish Market dataset
data = [
[11.52, 4.02, 242.0],
[12.48, 4.31, 290.0],
[12.38, 4.70, 340.0],
# ... (additional 17 data points)
]
# Create DataFrame
df = pd.DataFrame(data, columns=["Height", "Width", "Weight"])
# Independent variables (Height and Width)
X = df[["Height", "Width"]]
# Target variable (Weight)
y = df["Weight"]
# Fit the model
model = LinearRegression().fit(X, y)
# Extract coefficients
b0 = model.intercept_ # β₀
b1, b2 = model.coef_ # β₁ (Height), β₂ (Width)
print(f"Intercept (β₀): {b0:.4f}")
print(f"Height slope (β₁): {b1:.4f}")
print(f"Width slope (β₂): {b2:.4f}")Results:
- Intercept (β₀): -1005.2810
- Height slope (β₁): 78.1404
- Width slope (β₂): 82.0572
In simple linear regression, we fit a line through 2D data. The equation is:
In multiple linear regression with two features, we fit a plane through 3D data:
Where:
-
$\hat{y}$ : The predicted value of the dependent (target) variable -
$\beta_0$ : The intercept (the value of y when all x's are 0) -
$\beta_1$ : The coefficient (or slope) for feature$x_1$ -
$\beta_2$ : The coefficient for feature$x_2$ -
$x_1, x_2$ : The independent variables (features)
For any point
Predicted value: $\hat{y}i = \beta_0 + \beta_1 x{i1} + \beta_2 x_{i2}$
Residual:
The Sum of Squared Residuals (SSR) measures total prediction error:
Squaring ensures all errors contribute positively and gives more weight to larger deviations.
The goal is to find the values of
For curves with multiple variables, we use partial differentiation—differentiating each variable separately while treating others as constants.
We define the loss function:
At the minimum, all partial derivatives equal zero:
With respect to β₀:
This simplifies to:
With respect to β₁ and β₂:
Similar partial differentiation yields two more equations that, when solved together using Cramer's Rule, give:
Centering means subtracting the mean from each variable:
- Simplifies formulas by eliminating extra terms
- Ensures the mean of all variables is zero
- Improves numerical stability
- Makes the intercept easier to calculate:
$\beta_0 = \bar{y}$ (for centered data)
For a small dataset with 3 observations:
| i | Original x₁ | Original x₂ | Original y | Centered x'₁ | Centered x'₂ | Centered y' |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 10 | -2 | -2 | -4 |
| 2 | 4 | 5 | 14 | 0 | 0 | 0 |
| 3 | 6 | 7 | 18 | +2 | +2 | +4 |
After centering: $\sum x'{i1} = 0$, $\sum x'{i2} = 0$,
Subtract means from all observations.
This is how we derive the coefficients that Python's LinearRegression computes behind the scenes!
-
Multiple linear regression extends simple linear regression by fitting a plane (or hyperplane) through multi-dimensional data.
-
The goal is to find coefficients that minimize the Sum of Squared Residuals.
-
Calculus is essential: We use partial differentiation to find the point where the gradient is zero—the minimum of the cost function.
-
Three unknowns (
$\beta_0$ ,$\beta_1$ ,$\beta_2$ ) require solving a system of three equations. -
Data centering simplifies calculations and improves numerical stability.
-
The final equation is a direct result of mathematical optimization, not trial-and-error.
Part 2 of this series will cover model evaluation, interpreting coefficients, and handling more than two features. Stay tuned!
- Fish Market Dataset
- Linear Algebra and Calculus fundamentals
- Multiple Linear Regression theory
Author: Simanga Mchunu