Delete Duplicate Rows

(learn how to delete duplicate rows in MySQL by using the DELETE JOIN statement or an immediate table)

How to Delete Duplicate Rows in MySQL

Deleting duplicate rows from a table is a common task to clean up data and ensure data integrity. Here, we will explore three different methods to delete duplicate rows in MySQL, with clear explanations for each step.

Sample Data Setup

First, let's create a sample table and populate it with data for demonstration.

sql
Copy code
DROP TABLE IF EXISTS contacts;

CREATE TABLE contacts (
    id INT PRIMARY KEY AUTO_INCREMENT,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(255) NOT NULL
);

INSERT INTO contacts (first_name, last_name, email)
VALUES
    ('Carine', 'Schmitt', 'carine.schmitt@verizon.net'),
    ('Jean', 'King', 'jean.king@me.com'),
    ('Peter', 'Ferguson', 'peter.ferguson@google.com'),
    ('Janine', 'Labrune', 'janine.labrune@aol.com'),
    ('Jonas', 'Bergulfsen', 'jonas.bergulfsen@mac.com'),
    ('Janine', 'Labrune', 'janine.labrune@aol.com'),
    ('Susan', 'Nelson', 'susan.nelson@comcast.net'),
    ('Zbyszek', 'Piestrzeniewicz', 'zbyszek.piestrzeniewicz@att.net'),
    ('Roland', 'Keitel', 'roland.keitel@yahoo.com'),
    ('Jean', 'King', 'jean.king@me.com'),
    ('Susan', 'Nelson', 'susan.nelson@comcast.net'),
    ('Roland', 'Keitel', 'roland.keitel@yahoo.com');

View Sample Data

Check the data using this query:

sql
Copy code
SELECT * FROM contacts ORDER BY email;

Identify Duplicate Rows

To find duplicate rows, group by the column that has duplicates (e.g., email) and count how many times each value appears:

sql
Copy code
SELECT email, COUNT(email)
FROM contacts
GROUP BY email
HAVING COUNT(email) > 1;

This query shows which emails are duplicated and how many times they occur.

Methods to Delete Duplicate Rows

1. Using DELETE with JOIN

This approach compares the table with itself. It deletes duplicate rows but keeps the row with the highest id (latest entry):

sql
Copy code
DELETE t1
FROM contacts t1
INNER JOIN contacts t2
WHERE
    t1.id < t2.id AND
    t1.email = t2.email;

Explanation:

t1 and t2 are aliases for the same contacts table.
The INNER JOIN pairs rows with the same email.
The WHERE clause keeps the row with the higher id by deleting rows where t1.id < t2.id.

To keep the row with the lowest id, reverse the comparison:

sql
Copy code
DELETE t1
FROM contacts t1
INNER JOIN contacts t2
WHERE
    t1.id > t2.id AND
    t1.email = t2.email;

2. Using an Intermediate Table

This method involves creating a temporary table to store distinct rows, then replacing the original table.

Steps:

Create a temporary table with the same structure as the original:

sql
Copy code
CREATE TABLE contacts_temp LIKE contacts;

Insert distinct rows into the temporary table:

sql
Copy code
INSERT INTO contacts_temp
SELECT *
FROM contacts
GROUP BY email;

Explanation:

GROUP BY email ensures that only one row per email is retained.

Drop the original table and rename the temporary table:

sql
Copy code
DROP TABLE contacts;
ALTER TABLE contacts_temp RENAME TO contacts;

3. Using the ROW_NUMBER() Function

This approach assigns a unique rank to each row with duplicate values using the ROW_NUMBER() function (available in MySQL 8.0+).

Step 1: View Duplicate Rows Use ROW_NUMBER() to assign a rank to each duplicate group, ordered by the column with duplicates (e.g., email):

sql
Copy code
SELECT
    id,
    email,
    ROW_NUMBER() OVER (
        PARTITION BY email
        ORDER BY id
    ) AS row_num
FROM contacts;

Explanation:

PARTITION BY email: Groups rows by the email column.
ORDER BY id: Assigns a rank based on the id.

Step 2: Find IDs of Duplicate Rows Filter rows where row_num > 1 (duplicates):

sql
Copy code
SELECT id
FROM (
    SELECT
        id,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY id
        ) AS row_num
    FROM contacts
) t
WHERE row_num > 1;

Step 3: Delete Duplicate Rows Use the IDs identified above in a DELETE statement:

sql
Copy code
DELETE FROM contacts
WHERE id IN (
    SELECT id
    FROM (
        SELECT
            id,
            ROW_NUMBER() OVER (
                PARTITION BY email
                ORDER BY id
            ) AS row_num
        FROM contacts
    ) t
    WHERE row_num > 1
);

Summary

Here’s a quick comparison of the methods:

Method	Advantages	Best For
DELETE with JOIN	Fast and simple for small tables	Keeping the highest/lowest `id`
Intermediate Table	Works in all MySQL versions	Large datasets or simpler logic
ROW_NUMBER() Function	Advanced and flexible	MySQL 8.0+ with custom ranking

Use these methods based on your needs, ensuring that your data is clean and duplicates are removed effectively.