Python Code Note

Multiple Linear Regression in Python: A Comprehensive Guide

Below, I will provide a detailed step-by-step Python code implementation for multiple linear regression, explaining each step along the way.

Step 1: Import Necessary Libraries

We will use NumPy for numerical operations and data manipulation and Matplotlib for visualizing the data.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

Step 2: Create the Dataset

Let's create a simple dataset with multiple features manually. In practice, you would typically load data from a CSV or other data source.

# Sample data with multiple features (X1, X2, X3) and a target variable (y)
data = {
    'X1': [1, 4, 7, 2],
    'X2': [2, 0, 1, 4],
    'X3': [3, 5, 6, 4],
    'y': [14, 28, 36, 20]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Display the dataset
print(df)

Step 3: Prepare the Design Matrix $X$ and Target Vector $y$

We separate the independent variables (features) and the dependent variable (target).

# Extract features (independent variables) and target (dependent variable)
X = df[['X1', 'X2', 'X3']]  # Design matrix with features
y = df['y']  # Target vector

# Convert to NumPy arrays for computation
X = np.array(X)
y = np.array(y)

print("Design Matrix X:\n", X)
print("Target Vector y:\n", y)

Step 4: Add the Intercept Term to the Design Matrix

Linear regression needs an intercept term, so we add a column of ones to the design matrix.

# Add a column of ones to X to account for the intercept (bias term)
X_intercept = np.column_stack((np.ones(X.shape[0]), X))  # Add intercept column
print("Design Matrix with Intercept:\n", X_intercept)

Step 5: Compute $X^TX$ , $X^Ty,$ and the Coefficients using the Normal Equation

We will calculate each part of the normal equation step by step.


# Step 1: Compute X^T X (Transpose of X multiplied by X)
X_transpose = X_intercept.T
XtX = X_transpose @ X_intercept

print("X^T X:\n", XtX)

# Step 2: Compute X^T y (Transpose of X multiplied by y)
Xty = X_transpose @ y
print("X^T y:\n", Xty)

# Step 3: Compute the Inverse of X^T X
XtX_inv = np.linalg.inv(XtX)
print("Inverse of X^T X:\n", XtX_inv)

# Step 4: Compute the Coefficients (Beta) using the Normal Equation
beta = XtX_inv @ Xty
print("Coefficients (Beta):\n", beta)

Step 6: Interpretation of Coefficients

The output beta contains the intercept and the coefficients for each feature:

The first value in beta is the intercept (constant term).
The subsequent values are the coefficients for each feature (X1, X2, X3).

Step 7: Using Scikit-learn for Validation

To verify our manual calculations, we can use Scikit-learn's LinearRegression model to fit the data and compare the results.

python
Copy code
# Create and fit the Linear Regression model
model = LinearRegression(fit_intercept=True)
model.fit(X, y)

# Get coefficients and intercept from the model
intercept = model.intercept_
coefficients = model.coef_

print(f"Scikit-learn Intercept: {intercept:.2f}")
print(f"Scikit-learn Coefficients: {coefficients}")

Step 8: Comparing Results

Manual Calculation: The coefficients obtained manually using the normal equation should match those from Scikit-learn.
Validation: This validates our understanding and application of multiple linear regression.

Summary of Steps

Data Preparation: Create or load the dataset, separating features and target.
Design Matrix Construction: Add the intercept term to account for the bias in the model.
Normal Equation Computation: Calculate (XTX)−1XTy to find the coefficients.

(XTX)−1XTy

Model Validation: Compare results with Scikit-learn’s built-in model for consistency.

This comprehensive guide provides a clear step-by-step understanding of implementing multiple linear regression manually and validating it with Python’s libraries, solidifying your grasp of the underlying mathematics and its practical application in data science.