📑

Machine_Learning_Guide

Person

Untitled

Created

Sep 1, 2025 10:05 PM

Date

December 7, 2024

Materials

Oct 6, 2025 10:44 PM

Machine Learning Study Guide

Learning machine learning isn’t just about running models — it’s about following a structured workflow that ensures accuracy, reliability, and actionable insights.

To make my approach clear, I’ve organised this study guide into five key stages that I apply across projects:

Part 1: DATA PROCESSING

Data processing is essential to ensure that the dataset is clean, well-structured, and optimized for model building. This includes loading data, cleaning, handling missing values, engineering features, and setting up train-test splits.

‣

1.1 Loading the Data

‣

1.2 Data Overview

‣

1.3 Initial Feature Engineering

‣

1.4 Categorization of Encoding

Part 2: EXPLORATORY DATA ANALYSIS

The purpose of EDA is to gain insights into the data, identify patterns, detect anomalies, and guide feature engineering choices that may enhance model performance. Here’s a step-by-step guide for what to focus on at the EDA stage: The focus on the exploratory will be divided into two section of numerical and categorical data.

‣

2.1 EDA Focused on Numerical Data

‣

2.2 EDA Focused on Categorical Data

‣

2.3 Target Variable Analysis

‣

2.4 Feature Selection Techniques in Machine Learning

‣

Delete later 2.5 Features Selection

Part 3: SETTING UP THE VALIDATION FRAMEWORK

Validation is a crucial step in machine learning that ensures the model performs well on unseen data. A structured validation framework prevents overfitting and helps tune the model effectively by splitting the dataset into training, validation, and test sets.

Another key preprocessing step before training is feature scaling, which standardizes numerical features for better model performance. However, scaling must be applied correctly to avoid data leakage.

💡

Note that the data have to be split before applying features scaling

‣

3.1 Method 1: Splitting the Data

‣

3.1 Method 2 : Splitting the Data

‣

3.2 Feature Scaling in Machine Learning

‣

3.2 Cross Validation Model

Part 4: BUILDING AND TRAINING THE MODEL

This section focuses on selecting, initializing, training, and validating the model with an appropriate validation technique.

‣

4.1 Choosing a Model

‣

4.2 Initializing the Model

‣

4.3 Training the Model

‣

4.4 Hyperparameter Tuning

‣

4.5 Making Predictions (Inference)

‣

4.6 Saving and Loading Trained Models

‣

4.7 Deployment and Inference

‣

4.8 Comparing (y_pred vs y_train) and (y_pred vs y_test)

Part 5: EVALUATING THE MODEL

Evaluation metrics are essential for assessing the performance of machine learning models. Each metric provides a unique perspective on the model’s accuracy, reliability, and generalizability. Here’s a breakdown:

‣

1. Regression Evaluation Metrics

‣

2. Classification Evaluation Metrics

‣

3. Clustering Evaluation Metrics

‣

4. Time-Series Evaluation Metrics

💡

Normalization: To scale data values to a specific range, typically [0, 1]. This ensures that all features contribute equally to the model and prevents dominance by larger-scale features.

Method of Transforming Skewed Distribution

Transforming Skewed Distributions

💡

Transforming skewed distributions with logarithmic or other transformations to improve model interpretability and performance.

Books

Data science from Scratch

Comprehensive Guide: Saving a Model Using pickle in Python