Diamond Price Prediction: The 4C Model

Diamond Price Prediction: The 4C Model

Regression & Ensemble Learning | Python · scikit-learn · XGBoost · Streamlit
image

An end-to-end machine learning project predicting diamond prices using the industry’s 4C model (Cut, Color, Clarity, Carat) and advanced feature engineering.

📘 Project Overview

This project builds a machine learning model to predict diamond prices based on the 4C model (Cut, Color, Clarity, Carat) along with other physical attributes. The dataset contains 53,940 diamonds with features such as carat weight, cut quality, color, clarity, dimensions, and the target variable — price in USD.

The goal was not only to predict price accurately but also to understand which features drive valuation in the diamond industry.

🎯 Objectives

  • Develop a model that accurately predicts diamond prices.
  • Identify key attributes (carat, cut, color, clarity, dimensions) that impact pricing.
  • Compare linear and nonlinear models to capture complex relationships.
  • Deploy the model as an interactive Streamlit app for real-time predictions.

📊 Dataset Summary

Source: Kaggle – Diamond Dataset

  • Rows: 53,940
  • Features: 10 (mix of numerical and categorical)
  • Target: Price (USD)

Key Features:

  • Carat — weight of the diamond
  • Cut — Ideal, Premium, Good, etc.
  • Color — graded D (best) to J (lowest)
  • Clarity — purity of the stone (e.g., IF, VS1, SI2)
  • Depth/Table — proportions of the stone
  • x, y, z — dimensions in mm
  • Price — market value

🔍 Approach

1. Exploratory Data Analysis (EDA)

  • Explored distributions of categorical features (Cut, Color, Clarity).
  • Used chi-square tests to check independence among categorical variables.
  • Visualised numerical features (Carat, Depth, Table, Dimensions).
  • Detected and handled outliers using KDE plots and boxplots.

2. Feature Engineering

  • Correlation Analysis: Found Carat has the strongest correlation with Price (0.92).
  • PCA: Reduced dimensions (x, y, z) into principal components for efficiency.
  • Polynomial Features: Captured nonlinear relationships.
  • Encoding: Ordinal encoding for categorical features (Cut, Color, Clarity).
  • Target Transformation: Applied power transform to normalise skewed price distribution.

3. Model Development

Trained multiple models for comparison:

  • Linear Regression (baseline)
  • Decision Tree Regressor
  • Random Forest Regressor
  • Gradient Boosting & XGBoost
  • KNN Regressor

4. Model Evaluation

  • Metrics: R², RMSE, MAE
  • Best Model: XGBoost with R² = 0.982, RMSE = $486, MAE = $298

✨ Key Insights

  • Carat Weight: The most influential feature (92% correlation with price).
  • Cut Quality: Premium and Ideal cuts drive significant price premiums.
  • Color & Clarity: Moderate impact individually, but useful when combined.
  • Dimensions (x, y, z): Improved predictions when reduced via PCA.
  • Nonlinear Effects: Polynomial features helped capture complex pricing patterns.

🛠️ Tools & Tech

  • Languages & Libraries: Python, pandas, NumPy, matplotlib, seaborn, scikit-learn, XGBoost, statsmodels
  • Techniques: PCA, polynomial features, encoding, scaling, grid search
  • Deployment: Streamlit app + Pickle for model storage

✋ What I Did

  • Built a complete ML pipeline: data cleaning, preprocessing, feature scaling, encoding, and model deployment
  • Applied feature engineering (PCA for dimensions, polynomial features for non-linear effects)
  • Compared multiple models: Linear Regression, Random Forest, Gradient Boosting, and XGBoost
  • Deployed an interactive Streamlit web app for real-time price prediction

📂 Model Performance (Best: XGBoost)

R² = 0.982 | RMSE = $486 | MAE = $298

✨ Key Insights

  • Carat weight is the strongest predictor (92% correlation with price)
  • Cut quality (Premium & Ideal) drives significant value differences
  • Color & clarity moderately affect price but improve accuracy when combined
  • PCA-transformed dimensions strengthened predictions of stone size

🚀 Outcome

  • Delivered a production-ready predictive model.
  • Built a Streamlit web app for real-time predictions based on user inputs.
  • Provided actionable insights on what drives diamond pricing — useful for sellers, buyers, and jewelers.

🌐 Links

🎓 Learning Outcomes

  • Gained hands-on experience in end-to-end ML pipelines (EDA → Feature Engineering → Model → Deployment).
  • Strengthened understanding of feature importance in regression problems.
  • Practiced model comparison & hyperparameter tuning for performance.
  • Deployed a real-world ML application with user interaction.

🔮 Future Work

  • Add deep learning models for complex non-linear patterns.
  • Incorporate real-time data feeds (market API integration).
  • Improve explainability with SHAP values.
  • Extend model to A/B test alternative prediction strategies.
image

🖐️ Key Business Insights

  • Premium Strategy: Focus on stocking diamonds with Ideal cut + VS2 clarity for high-end customers.
  • Budget Strategy: Target price-conscious buyers with Very Good cut + Color E–F combinations.
  • Marketing Angle: Use cut × color and cut × clarity insights in campaigns to educate customers on how attributes influence value.
  • Value Segments: Highlight clarity × color pairings for buyers who balance aesthetics with durability.
image

🖐️ Key Observations

The heatmap highlights the relationship between cut and clarity, with intensity showing how frequently each combination appears. The numbers within the grid represent the actual count of diamonds.

  • Ideal Cut × VS2 Clarity → 5,071 diamonds — the most popular pairing. This reflects strong buyer preference for a balance between maximum brilliance and acceptable clarity, making it highly desirable in the market.
  • Very Good Cut × VS2 Clarity → 2,591 diamonds — another significant cluster, showing strong demand for high-quality cuts with slight inclusions, at a relatively more accessible price.
  • Fair Cut × VS2 Clarity → 210 diamonds — least popular. The lower cut quality reduces appeal, even when paired with acceptable clarity, showing buyers prioritise brilliance over purity.
image

🖐️ Insight

  • Price Outliers: The dataset contains several price outliers, with one diamond recorded at $18,823. This requires further investigation to confirm whether it is a genuine data point or a potential error.
  • Carat Outliers: A few extreme values were found in the Carat attribute, including one diamond weighing over 4 carats. These could distort model performance if not treated appropriately.
  • Depth & Table Stability: Unlike price and carat, the Depth and Table attributes show minimal outliers, suggesting that most diamonds remain within normal physical proportions.
image

🖐️ Insight

  • Carat vs Price: The scatter plot reveals a strong linear relationship between Carat and Price — larger diamonds are consistently more expensive.
  • Color Impact: Higher color grades correlate with higher prices, highlighting how subtle differences in color drive significant valuation changes.
  • Cut & Clarity Influence: Both Cut and Clarity show clear price impacts, with premium grades commanding higher market value.

🏁 Conclusion

This project demonstrates how machine learning and financial analytics can bring clarity to complex valuation problems. By analysing over 53,000 diamonds through the lens of the 4Cs (Carat, Cut, Color, Clarity), I was able to:

  • Build a highly accurate predictive model (R² = 0.982) using XGBoost.
  • Uncover key insights such as the dominant role of carat weight, the premium impact of cut quality, and the combined influence of color and clarity.
  • Translate technical findings into business-relevant strategies — guiding inventory decisions, pricing strategies, and customer education.
  • Deliver a production-ready Streamlit web app that allows users to interact with the model in real time.

Ultimately, this project highlights my ability to bridge data science with business outcomes — turning raw datasets into actionable insights that support smarter decision-making.

© Teslim Adeyanju 2025. All Rights Reserved.