



Step 1: Import Necessary Libraries
python
Copy code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Load Your Data
python
Copy code
df = pd.read_csv('your_dataset.csv') # Use your own dataset here
Step 3: Data Cleaning (Same as Before)
python
Copy code
# Checking for missing values
print(df.isnull().sum())
# Fill missing numerical values with the mean
df['numerical_column'] = df['numerical_column'].fillna(df['numerical_column'].mean())
# Drop rows with missing categorical values
df.dropna(subset=['categorical_column'], inplace=True)
# Remove duplicates
df = df.drop_duplicates()
Step 4: Classify Data into Numerical and Categorical (Same as Before)
python
Copy code
numerical_columns = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist()
# If categorical variables exist, one-hot encode them
df = pd.get_dummies(df, columns=categorical_columns)
Step 5: Split Data into Train, Validation, and Test Sets
python
Copy code
# Define the features (X) and the target (y)
X = df.drop('target_column', axis=1) # Replace with your target variable
y = df['target_column']
# Convert to numpy arrays
X = np.array(X)
y = np.array(y)
# Split into train/validation and test sets (80% train+validation, 20% test)
train_size = int(0.8 * len(X))
X_train_val, X_test = X[:train_size], X[train_size:]
y_train_val, y_test = y[:train_size], y[train_size:]
# Split train+validation into train and validation sets (80% train, 20% validation)
val_size = int(0.25 * len(X_train_val))
X_train, X_val = X_train_val[:-val_size], X_train_val[-val_size:]
y_train, y_val = y_train_val[:-val_size], y_train_val[-val_size:]
Step 6: Implement Linear Regression from Scratch
Linear Regression Model
The formula for linear regression is:
y=Xw+b
y=Xw+b
Where:
X
 is the feature matrixw
 is the weight vectorb
 is the bias (intercept)
The goal is to find w
 and b
 that minimize the sum of squared residuals.
Training the Linear Regression Model (Using the Normal Equation)
The weights w
 and bias b
 can be solved using the normal equation:
w=(XTX)−1XTy
w=(XTX)−1XTy
python
Copy code
def linear_regression(X, y):
# Add a bias term (intercept) as the first column in X
X_b = np.c_[np.ones((X.shape[0], 1)), X]
# Compute the optimal weights using the normal equation
weights = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
return weights
# Train the model using the training data
weights = linear_regression(X_train, y_train)
# Print the model weights and bias
print("Weights: ", weights[1:])
print("Bias: ", weights[0])
Step 7: Make Predictions
The prediction function for linear regression is simply multiplying the feature matrix by the learned weights and adding the bias term.
python
Copy code
def predict(X, weights):
# Add a bias term (intercept) as the first column in X
X_b = np.c_[np.ones((X.shape[0], 1)), X]
# Compute predictions
return X_b.dot(weights)
# Make predictions on the validation set
y_val_pred = predict(X_val, weights)
Step 8: Calculate RMSE
Root Mean Squared Error (RMSE) measures the difference between predicted and actual values.
python
Copy code
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred) ** 2))
# Calculate RMSE on the validation set
validation_rmse = rmse(y_val, y_val_pred)
print(f'Validation RMSE: {validation_rmse}')
Step 9: Graphical Analysis (Prediction vs Actual)
python
Copy code
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_val_pred, alpha=0.7)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values (Validation Set)')
plt.show()
Step 10: Evaluate on the Test Set
Finally, evaluate the model on the test set:
python
Copy code
y_test_pred = predict(X_test, weights)
test_rmse = rmse(y_test, y_test_pred)
print(f'Test RMSE: {test_rmse}')
# Graphical analysis on the test set
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_pred, alpha=0.7)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values (Test Set)')
plt.show()
Conclusion:
You now have a simple linear regression model implemented from scratch, along with train-validation-test splitting, prediction, RMSE evaluation, and graphical analysis without relying on Scikit-Learn.
Let me know if you need further help!