Multiple Linear Regression in Python: A Comprehensive Guide
Below, I will provide a detailed step-by-step Python code implementation for multiple linear regression, explaining each step along the way.
Step 1: Import Necessary Libraries
We will use NumPy for numerical operations and data manipulation and Matplotlib for visualizing the data.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
Step 2: Create the Dataset
Let's create a simple dataset with multiple features manually. In practice, you would typically load data from a CSV or other data source.
# Sample data with multiple features (X1, X2, X3) and a target variable (y)
data = {
'X1': [1, 4, 7, 2],
'X2': [2, 0, 1, 4],
'X3': [3, 5, 6, 4],
'y': [14, 28, 36, 20]
}
# Convert to a DataFrame
df = pd.DataFrame(data)
# Display the dataset
print(df)
Step 3: Prepare the Design Matrix and Target Vector
We separate the independent variables (features) and the dependent variable (target).
# Extract features (independent variables) and target (dependent variable)
X = df[['X1', 'X2', 'X3']] # Design matrix with features
y = df['y'] # Target vector
# Convert to NumPy arrays for computation
X = np.array(X)
y = np.array(y)
print("Design Matrix X:\n", X)
print("Target Vector y:\n", y)
Step 4: Add the Intercept Term to the Design Matrix
Linear regression needs an intercept term, so we add a column of ones to the design matrix.
# Add a column of ones to X to account for the intercept (bias term)
X_intercept = np.column_stack((np.ones(X.shape[0]), X)) # Add intercept column
print("Design Matrix with Intercept:\n", X_intercept)
Step 5: Compute , and the Coefficients using the Normal Equation
We will calculate each part of the normal equation step by step.
# Step 1: Compute X^T X (Transpose of X multiplied by X)
X_transpose = X_intercept.T
XtX = X_transpose @ X_intercept
print("X^T X:\n", XtX)
# Step 2: Compute X^T y (Transpose of X multiplied by y)
Xty = X_transpose @ y
print("X^T y:\n", Xty)
# Step 3: Compute the Inverse of X^T X
XtX_inv = np.linalg.inv(XtX)
print("Inverse of X^T X:\n", XtX_inv)
# Step 4: Compute the Coefficients (Beta) using the Normal Equation
beta = XtX_inv @ Xty
print("Coefficients (Beta):\n", beta)
Step 6: Interpretation of Coefficients
The output beta
contains the intercept and the coefficients for each feature:
- The first value in
beta
is the intercept (constant term). - The subsequent values are the coefficients for each feature (
X1
,X2
,X3
).
Step 7: Using Scikit-learn for Validation
To verify our manual calculations, we can use Scikit-learn's LinearRegression
model to fit the data and compare the results.
python
Copy code
# Create and fit the Linear Regression model
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
# Get coefficients and intercept from the model
intercept = model.intercept_
coefficients = model.coef_
print(f"Scikit-learn Intercept: {intercept:.2f}")
print(f"Scikit-learn Coefficients: {coefficients}")
Step 8: Comparing Results
- Manual Calculation: The coefficients obtained manually using the normal equation should match those from Scikit-learn.
- Validation: This validates our understanding and application of multiple linear regression.
Summary of Steps
- Data Preparation: Create or load the dataset, separating features and target.
- Design Matrix Construction: Add the intercept term to account for the bias in the model.
- Normal Equation Computation: Calculate (XTX)−1XTy to find the coefficients.
- Model Validation: Compare results with Scikit-learn’s built-in model for consistency.
(XTX)−1XTy
This comprehensive guide provides a clear step-by-step understanding of implementing multiple linear regression manually and validating it with Python’s libraries, solidifying your grasp of the underlying mathematics and its practical application in data science.