🥰

~Inspiration~

image
image
image
image

Step 1: Import Necessary Libraries

python
Copy code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load Your Data

python
Copy code
df = pd.read_csv('your_dataset.csv')  # Use your own dataset here

Step 3: Data Cleaning (Same as Before)

python
Copy code
# Checking for missing values
print(df.isnull().sum())

# Fill missing numerical values with the mean
df['numerical_column'] = df['numerical_column'].fillna(df['numerical_column'].mean())

# Drop rows with missing categorical values
df.dropna(subset=['categorical_column'], inplace=True)

# Remove duplicates
df = df.drop_duplicates()

Step 4: Classify Data into Numerical and Categorical (Same as Before)

python
Copy code
numerical_columns = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist()

# If categorical variables exist, one-hot encode them
df = pd.get_dummies(df, columns=categorical_columns)

Step 5: Split Data into Train, Validation, and Test Sets

python
Copy code
# Define the features (X) and the target (y)
X = df.drop('target_column', axis=1)  # Replace with your target variable
y = df['target_column']

# Convert to numpy arrays
X = np.array(X)
y = np.array(y)

# Split into train/validation and test sets (80% train+validation, 20% test)
train_size = int(0.8 * len(X))
X_train_val, X_test = X[:train_size], X[train_size:]
y_train_val, y_test = y[:train_size], y[train_size:]

# Split train+validation into train and validation sets (80% train, 20% validation)
val_size = int(0.25 * len(X_train_val))
X_train, X_val = X_train_val[:-val_size], X_train_val[-val_size:]
y_train, y_val = y_train_val[:-val_size], y_train_val[-val_size:]

Step 6: Implement Linear Regression from Scratch

Linear Regression Model

The formula for linear regression is:

y=Xw+b

y=Xw+b

Where:

  • X is the feature matrix
  • w is the weight vector
  • b is the bias (intercept)

The goal is to find w and b that minimize the sum of squared residuals.

Training the Linear Regression Model (Using the Normal Equation)

The weights w and bias b can be solved using the normal equation:

w=(XTX)−1XTy

w=(XTX)−1XTy

python
Copy code
def linear_regression(X, y):
    # Add a bias term (intercept) as the first column in X
    X_b = np.c_[np.ones((X.shape[0], 1)), X]

    # Compute the optimal weights using the normal equation
    weights = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

    return weights

# Train the model using the training data
weights = linear_regression(X_train, y_train)

# Print the model weights and bias
print("Weights: ", weights[1:])
print("Bias: ", weights[0])

Step 7: Make Predictions

The prediction function for linear regression is simply multiplying the feature matrix by the learned weights and adding the bias term.

python
Copy code
def predict(X, weights):
    # Add a bias term (intercept) as the first column in X
    X_b = np.c_[np.ones((X.shape[0], 1)), X]

    # Compute predictions
    return X_b.dot(weights)

# Make predictions on the validation set
y_val_pred = predict(X_val, weights)

Step 8: Calculate RMSE

Root Mean Squared Error (RMSE) measures the difference between predicted and actual values.

python
Copy code
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

# Calculate RMSE on the validation set
validation_rmse = rmse(y_val, y_val_pred)
print(f'Validation RMSE: {validation_rmse}')

Step 9: Graphical Analysis (Prediction vs Actual)

python
Copy code
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_val_pred, alpha=0.7)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values (Validation Set)')
plt.show()

Step 10: Evaluate on the Test Set

Finally, evaluate the model on the test set:

python
Copy code
y_test_pred = predict(X_test, weights)
test_rmse = rmse(y_test, y_test_pred)
print(f'Test RMSE: {test_rmse}')

# Graphical analysis on the test set
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_pred, alpha=0.7)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs Actual Values (Test Set)')
plt.show()

Conclusion:

You now have a simple linear regression model implemented from scratch, along with train-validation-test splitting, prediction, RMSE evaluation, and graphical analysis without relying on Scikit-Learn.

Let me know if you need further help!