Polynomial Regression

Classification

Basic Linear Models

Import Statement

Model Name

Status 1

Done

Polynomial Regression for Beginners

Learn polynomial regression from basic to advanced techniques. Easily identify trend, make predictions, and develop models for large datasets,

www.analyticsvidhya.com

Introduction to Polynomial Regression

Definition: Polynomial regression models the relationship between the independent variable(s) and the dependent variable as an n-th degree polynomial. It extends linear regression to capture nonlinear trends.
Use Case: Ideal when data points have a curvilinear relationship (e.g., quadratic, cubic).
Polynomial regression is a supervised approach where non-linear relationships between dependent and independent variables are modeled with the help of a polynomial function. We can use polynomial regression to analyze non-linear data, such as curves, growth rates, disease outbreaks, etc. For example, the spread rate of infectious diseases (such as Covid 19), which is highly non-linear with time, can be fitted with the help of polynomial regression.

Why Use Polynomial Regression?

Linear regression generally fails to model non-linear data. Instead, we can use polynomial regression to fit non-linear variables with the help of polynomial functions. Let us try to understand the difference between polynomial and linear regression with the help of the figure below. The figure illustrates the relationship between two variables, X and Y. It shows that the linear regression plot (a straight line) seems to be a very poor fit for the nonlinear scattered data. However, the polynomial regression plot (curvilinear plot) approximates the scattered data better. Hence, we need to use Polynomial regression to fit non-linear data.

Mathematical Formulation

In the case of linear regression, the relationship between dependent and independent variables can be expressed using a linear equation:

y = \beta_0 + \beta_1 x + \cdots + \epsilon

Here, y is the dependent variable, x is the independent variable, and b is the coefficient. For non-linear data, we need to modify the linear equation by including higher-order terms of x. As an illustration, we can introduce a quadratic term into the linear equation to account for a non-linear relationship.

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n + \epsilon

$\beta_0$ , $\beta_1$ … Coefficients (learned during training).

When the polynomial is of degree 2, it is called a quadratic model; when the degree of a polynomial is 3, it is called a cubic model, and so on.

Feature Transformation

Polynomial regression uses polynomial features, which are a type of feature engineering in machine learning models, to capture nonlinear relationships between the dependent and independent variables. With the help of feature transformation, the original features of datasets are transformed into polynomial features of the required degree, allowing the capture of non-linear relationships between variables.

Suppose we have a dataset with a non-linear relationship between input feature x and output feature y. We want to find the relationship between x and y with the help of polynomial regression of degree 3. To do this, we have to create new features such as $x^2$ and $x^3$ from existing feature x. We then add the new features to the original dataset to create a transformed version of x. The transformed version allows us to consider the original features, their interactions, and higher-order terms. By doing so, it becomes capable of fitting nonlinear relationships in the data.

Key Concepts in Scikit-Learn

The polynomia comprised of the PolynomialFeatures, and the LinearRegression

Data Transformation with PolynomialFeatures: The PolynomialFeatures class in scikit-learn transforms your original features into a new feature matrix consisting of all polynomial combinations of the features up to a given degree.

PolynomialFeatures (Class): Generates polynomial and interaction features.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=True)
X_poly = poly.fit_transform(X)

b. Fitting the Model with LinearRegression: Once you have transformed your features, you can fit a linear model to the polynomial features. Although we’re dealing with a polynomial relationship, the model remains linear in terms of the coefficients.

LinearRegression (Class): Fits a linear model to the polynomial features.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)

Key Parameters

`PolynomialFeatures` Parameters

Parameter	Description	Default
`degree`	Highest degree of polynomial terms.	`2`
`include_bias`	Whether to include a bias (intercept) term.	`True`
`interaction_only`	If `True`, only interaction terms (e.g., $x^1$ ) are included.	`False`

`LinearRegression` Parameters

Parameter	Description	Default
`fit_intercept`	Whether to calculate the intercept.	`True`
`copy_X`	Copy or overwrite input data.	`True`

Key Components and Workflow

Data Transformation with PolynomialFeatures

Before fitting the model, you need to transform your original features into their polynomial terms. Scikit-learn’s PolynomialFeatures does exactly that.

Key parameters:

degree: The degree of the polynomial features (e.g., 2, 3, …).
interaction_only: If set to True, only interaction features are produced (no power terms).
include_bias: If True, adds a bias column (a column of ones) to the output.

Model Fitting with LinearRegression

After transforming the data, you fit a linear regression model to the new polynomial features.

Key parameters:

fit_intercept: If True (default), the model calculates the intercept. If False, it is assumed that the data is already centered.
normalize (deprecated in newer versions): Previously used to normalize features before regression.

Key functions and attributes:

fit(X, y): Trains the model using the training data.
predict(X): Makes predictions for given input features.
coef_: Array of coefficients corresponding to the polynomial terms.
intercept_: The intercept term of the model.

Key Attributes

PolynomialFeatures:

powers_: Array showing exponents for each feature combination.

LinearRegression:

coef_: Coefficients of the polynomial terms.
intercept_: Bias term (β0).

Assumptions

Polynomial regression inherits many assumptions from linear regression:

Linearity (in the parameters): Although the model is nonlinear in x, it is linear in the coefficients β.
Independence: The observations are independent of each other.
Homoscedasticity: The residuals (errors) have constant variance.
Normality of residuals: The errors are normally distributed (especially important for inference).
No multicollinearity: High correlation among predictors can lead to unstable estimates. Note that as you add higher-order polynomial terms, multicollinearity can become an issue.

It is also important to avoid overfitting, especially when using high-degree polynomials. Techniques like cross-validation can help in selecting the appropriate polynomial degree.

Example Code with Case Study

Problem: Predict fuel efficiency based on engine horsepower (nonlinear relationship).

# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate Synthetic Data (Quadratic Relationship)
np.random.seed(42)
X = np.linspace(50, 200, 100).reshape(-1, 1)
y = 0.01 * X**2 - 2*X + 150 + np.random.normal(0, 10, (100, 1))

# Create Polynomial Features (Degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Fit Linear Regression Model
model = LinearRegression()
model.fit(X_poly, y)

# Predictions
y_pred = model.predict(X_poly)

# Plot Results
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label='Polynomial Fit')
plt.xlabel('Horsepower')
plt.ylabel('Fuel Efficiency')
plt.legend()
plt.show()

# Model Evaluation
print(f'R-squared: {r2_score(y, y_pred):.2f}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

R-squared: 0.92
Coefficients: [[-2.5   0.01]]
Intercept: [150.3]

Overfitting and Regularization

Overfitting Concerns:

High Degree Polynomials: Using a very high degree can make your model overly complex, fitting noise instead of the true underlying trend.
Visual Diagnosis: Plot the fitted curve versus the actual data to check if the model is too “wiggly” or deviates wildly outside the range of training data.

Mitigation Strategies:

Cross-Validation: Use techniques like k-fold cross-validation to select an appropriate degree.
Regularization: Implement Ridge (L2) or Lasso (L1) regression to penalize large coefficients and reduce overfitting.

Example with Pipeline and Ridge Regression:


from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# Create a pipeline to streamline scaling, polynomial transformation, and regularized regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('ridge', Ridge(alpha=1.0))  # alpha is the regularization strength
])

pipeline.fit(X, y)
y_pred = pipeline.predict(X)

7. Practical Code Example with Explanations

Here’s a full code example that demonstrates the complete workflow using scikit-learn. This example includes data generation, transformation, fitting, evaluation, and visualization.


import numpy as np
import matplotlib.pyplot as plt
import operator
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

# 1. Data Generation
np.random.seed(0)
X = 2 - 3 * np.random.normal(0, 1, 100)  # Random data
y = X - 2 * (X ** 2) + np.random.normal(-3, 3, 100)  # Quadratic relationship with noise
X = X[:, np.newaxis]  # Reshape to 2D array

# 2. Define a pipeline for polynomial regression with standard scaling and Ridge regularization
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Optional: Helps with numerical stability
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('linreg', LinearRegression())
])

# Alternatively, for regularized regression, use Ridge:
# pipeline = Pipeline([
#     ('scaler', StandardScaler()),
#     ('poly', PolynomialFeatures(degree=2, include_bias=False)),
#     ('ridge', Ridge(alpha=1.0))
# ])

# 3. Fit the model
pipeline.fit(X, y)
y_pred = pipeline.predict(X)

# 4. Model Evaluation
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("Model Evaluation:")
print("Mean Squared Error:", mse)
print("R-squared:", r2)

# 5. Cross-Validation for Model Robustness
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print("Cross-Validated R-squared scores:", cv_scores)
print("Average CV R-squared:", np.mean(cv_scores))

# 6. Plotting the Results
plt.scatter(X, y, color='blue', label='Data points')
# Sorting values for a smooth curve plot
sorted_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(X, y_pred), key=sorted_axis)
X_sorted, y_pred_sorted = zip(*sorted_zip)
plt.plot(X_sorted, y_pred_sorted, color='red', label='Polynomial fit (degree 2)')
plt.xlabel("Feature")
plt.ylabel("Target")
plt.title("Polynomial Regression Example")
plt.legend()
plt.show()

Model Evaluation:
Mean Squared Error: 9.447441952450275
R-squared: 0.9898873384220381
Cross-Validated R-squared scores: [0.93409403 0.99654519 0.98922503 0.9900232  0.97828593]
Average CV R-squared: 0.9776346761351707

Code Walkthrough:

Data Generation:

Synthetic data is generated with an underlying quadratic relationship plus noise.

Pipeline Creation:

A pipeline is used to combine scaling, polynomial feature generation, and regression into one streamlined process.

Model Fitting and Prediction:

The model is trained, and predictions are generated.

Evaluation:

The model’s performance is evaluated using MSE and R2R2. Cross-validation helps assess model robustness.

Visualization:

A scatter plot of the original data and the regression curve illustrates the polynomial fit.

Advantages, Limitations, and Best Practices

Advantages:

Flexibility: Able to model nonlinear relationships without changing the basic linear regression framework.
Interpretability: Coefficients (after transformation) can be interpreted similar to linear regression.
Ease of Use: Can be implemented quickly with scikit-learn’s tools.

Limitations:

Overfitting: High-degree polynomials can capture noise rather than the underlying pattern.
Extrapolation Risk: Predictions outside the range of training data can be highly unreliable.
Multicollinearity: Adding polynomial terms increases correlation among predictors, potentially destabilizing coefficient estimates.

Best Practices:

Degree Selection: Use cross-validation to choose the degree that balances bias and variance.
Regularization: When using higher-degree polynomials, consider Ridge or Lasso regression.
Feature Scaling: Scale features when using polynomial terms to improve numerical stability.
Visualization: Always plot your fit to ensure that the model behaves as expected.

Summary

Definition: Polynomial regression models nonlinear relationships by expanding input features into polynomial terms.
Workflow: Involves scaling (optional), polynomial feature transformation, model fitting, and evaluation.
Key Tools:

PolynomialFeatures for transformation
LinearRegression (or Ridge/Lasso for regularization) for fitting
Pipeline to streamline the process

Assumptions: Inherits linear regression assumptions with extra care for overfitting and multicollinearity.
Evaluation: Use metrics such as MSE, R2, and cross-validation.
Best Practices: Choose the degree wisely, use regularization as needed, and visualize your results.

This comprehensive note should now cover all the essential aspects of polynomial regression—from theory to practice—and serve as a robust study guide as you progress in your machine learning and data science career.

Example 2:

Here, we will develop another regression model with multiple independent variables. This problem has three independent variables (TV, radio, newspaper) and one dependent variable (sales). Our aim is to find the relationship between the independent and dependent variables. The dataset consists of 200 rows and 4 columns. In the table below, we have shown the partial datasets for visualization of the data.

Read Data From CSV File Using Pandas

df = pd.read_csv("Advertising.csv")

Perform Polynomial Transformation

from sklearn.preprocessing import PolynomialFeatures

polynomial_transformer = PolynomialFeatures(degree=2,include_bias=False)
X_transform = polynomial_transformer.fit_transform(X)

Perform Regression

# Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=101)

from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

# Fit model
model.fit(X_train,y_train)

# Model Prediction
model_predictions = model.predict(X_test)
print(model_predictions)

[13.94856153 19.33480262 12.31928162 16.76286337  7.90210901  6.94143792
20.13372693 17.50092709 10.56889    20.12551788  9.44614537 14.09935417
12.05513493 23.39254049 19.67508393  9.15626258 12.1163732   9.28149557
 8.44604007 21.65588129  7.05070331 19.35854208 27.26716369 24.58689346
 9.03179421 11.81070232 20.42630125  9.19390639 12.74795186  8.64340674
 8.66294151 20.20047377 10.93673817  6.84639129 18.27939359  9.47659449
10.34242145  9.6657038   7.43347915 11.03561332 12.65731013 10.65459946
11.20971496  7.46199023 11.38224982 10.27331262  6.15573251 15.50893362
13.36092889 22.71839277 10.40389682 13.21622701 14.23622207 11.8723677
11.68463616  5.62217738 25.03778913  9.53507734 17.37926571 15.7534364 ]

# Compute MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error,mean_squared_error

MAE = mean_absolute_error(y_test,model_predictions)
MSE = mean_squared_error(y_test,model_predictions)
RMSE = np.sqrt(MSE)
print("MAE:",MAE )
print("MSE:",MSE )
print("RMSE:",RMSE )

MAE: 0.4896798044803838
MSE: 0.4417505510403753
RMSE: 0.6646431757269274

Limitations

The following are the limitations of the polynomial regression:

Overfitting: Overfitting happens when the model fits training data well but makes poor predictions on new data. We may often encounter overfitting issues when dealing with higher-order polynomials.
Poor Extrapolation: The model may perform poorly in predicting values outside the range of training data. Hence, exploration outside the range should be avoided.
Presumption Of Polynomial Relationship: In polynomial regression, we assume a polynomial relationship between dependent and independent variables. However, the model may perform poorly if the relationship between dependent and independent variables is not polynomial.
Computational Complexity: Polynomial regression with a high degree of polynomial on large datasets can be computationally expensive.
Decreased Interpretability: This type of model may become highly complex and difficult to interpret as the degree of polynomial increases.