Delete Duplicate Emails SQL Challenge - PostgreSQL Solution

1. Introduction

In this blog post, we will tackle a common database challenge in PostgreSQL: deleting duplicate email addresses while preserving the entry with the smallest id. This operation is essential for maintaining data quality and integrity.

Problem

The goal is to write a PostgreSQL query to delete all duplicate email entries in the Person table, ensuring that only the entry with the smallest id for each email is kept.

Database Table:

Person Table:
id email
1 john@example.com
2 bob@example.com
3 john@example.com

3. PostgreSQL Solution

DELETE FROM Person
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) as rn
        FROM Person
    ) t
    WHERE t.rn > 1
);

Output:

id email
1 john@example.com
2 bob@example.com

Explanation:

1. The query employs a subquery that uses the ROW_NUMBER() window function, partitioning by email and ordering by id.

2. This window function assigns a row number to each email based on its id, within each partition of duplicate emails.

3. In the outer query, the DELETE statement removes rows where the row number is greater than 1, which means all duplicate emails except the one with the smallest id.

4. As a result, only one unique email with the smallest id remains for each set of duplicate emails.


Comments