How to Remove Duplicate Records in SQL

Do you need to use SQL to remove duplicates in your tables? Learn how to write an SQL query to remove duplicate data in this article.

The Problem - Removing Duplicates in SQL

Let’s say you have a table with some data in it. You’ve found out that there is some duplicate data in this table. And you want to get rid of the duplicates.

SQL Remove Duplicates

The way you define duplicate data could be dependant on your data.

  • Is it a duplicate if all of the columns are the same?
  • Is it a duplicate if all columns except for the primary key are the same?
  • Is it a duplicate if only a few columns are the same?

In any case, identifying and removing duplicates is possible in SQL.

There are several ways to do it. I’ll explain each of these methods.

We'll see how it can work on Oracle, SQL Server, MySQL, and PostgreSQL.

The sample of data has 1,220 records in a single table, which looks like this:

customer_id first_name last_name address
1 'Bamby' 'Applin' 29 Lighthouse Bay Lane
2 'Blinny' 'Kirman' 96024 Jenifer Center
3 'Rodie' 'Dils' 62729 Debs Point
4 'Abdel' 'Grisedale' 711 Lawn Park
5 'Fernanda' 'Lyokhin' 54 Dorton Place

 

Let’s say that a record is a duplicate if it contains the same first_name and last_name values.

Let’s take a look at the different ways to remove duplicates in SQL.

 

Summary of Methods

Here's a summary of the different methods and which databases they work on.

Method Oracle SQL Server MySQL PostgreSQL
ROW_NUMBER Y Y Y Y
Delete with Join Y Y Y Y
MIN or MAX Y Y Y Y
DENSE_RANK Y Y N Y
Correlated Subquery with MIN or MAX Y N N N
Subquery with ANY Y N N N

 

A Note on Query Times

In each of these examples, I explain the code I am using, what it does, and delete data using the DELETE statement.

However, any query times should only be taken as a guide, and may be different from the performance you get:

  • Your table structure, unique data business rules, and number of records will be different
  • Your database version and specs will be different
  • Several other factors may impact performance such as query cache or background processes on my computer.

So, I guess my point is that you might not get the same performance as I do. But, if you try one method and it has poor performance for your data, try one of the other methods.

 

Method 1 - ROW_NUMBER Analytic Function

Database: Oracle, MySQL, SQL Server, PostgreSQL

The first method I’ll show you is using an analytic function called ROW_NUMBER. It’s been recommended in several places such as StackOverflow questions and an AskTOM thread.

It involves several steps:

  1. Using a subquery to find each ROWID (which is a unique number given to each row in an Oracle table) and the ROW_NUMBER function to find a sequential number for that row, grouped by the fields you specify as unique.
  2. Find the ROWID values that are identified as duplicates.
  3. Delete rows that match these ROWIDs.

The query looks like this:

1DELETE FROM table a
2WHERE a.ROWID IN
3 (SELECT ROWID FROM
4   (SELECT
5    ROWID,
6    ROW_NUMBER() OVER
7    (PARTITION BY unique_columns ORDER BY ROWID) dup
8    FROM table)
9  WHERE dup > 1);

The ROW_NUMBER function here is used as an analytic function. It uses the PARTITION BY to create partitions or groups based on the fields I’ve mentioned in the PARTITION BY (which are first_name, last_name, and created_date). I’ve then ordered them by the ROWID.

This means that the first occurrence of those fields will get the number of 1. The second occurrence of all of those fields will get a number of 2, and so on.

The ROWIDs are then returned to the DELETE statement at the top, which only deletes records where the ROW_NUMBER function (which has an alias of “dup” in this example) are greater than one. (The AskTOM thread uses “WHERE dup <> 1” but it achieves the same thing).

It’s a good idea to check the records you’re deleting first, by running a SELECT statement using this criteria. For the purposes of these examples, I’ll check the COUNT of the records about to be deleted, by replacing the DELETE with a SELECT COUNT(*).

 1SELECT COUNT(*)
 2FROM customer a
 3WHERE a.ROWID IN
 4  (SELECT ROWID FROM
 5     (SELECT
 6      ROWID,
 7      ROW_NUMBER() OVER
 8        (PARTITION BY first_name, last_name, address
 9         ORDER BY ROWID) dup
10    FROM customer)
11    WHERE dup > 1);

Result:

1220

It shows there are 220 duplicate records.

In MySQL, or other databases, your query may look like this:

 1SELECT COUNT(*)
 2FROM customer a
 3WHERE a.customer_id IN
 4  (SELECT customer_id FROM
 5     (SELECT
 6      customer_id,
 7      ROW_NUMBER() OVER
 8        (PARTITION BY first_name, last_name, address
 9         ORDER BY customer_id) dup
10    FROM customer)
11    WHERE dup > 1);

If I run this command as a DELETE statement:

1DELETE FROM customer a
2WHERE a.ROWID IN
3  (SELECT ROWID FROM
4    (SELECT
5    ROWID,
6    ROW_NUMBER() OVER
7      (PARTITION BY first_name, last_name, address) dup
8    FROM customer)
9  WHERE dup > 1);

Result:

1220 records deleted.

This is the method I would use if I needed to delete duplicate records from a table. It uses a ROWID filter which is usually a fast way to access a table.

Method 2: Delete with JOIN

Database: Oracle, SQL Server, MySQL, PostgreSQL

This is a commonly recommended method for MySQL and works for all other databases. It involves joining the same table to itself, specifying the matching columns, and deleting all but one duplicate row.

Here's the sample query:

1DELETE t1
2FROM customer t1
3INNER JOIN customer t2
4ON t1.customer_id < t2.customer_id
5AND t1.first_name = t2.first_name
6AND t1.last_name = t2.last_name
7AND t1.address = t2.address;

We specify the customer table twice. We specify the first_name, last_name, and address fields, which identify the duplicate row.

We also specify customer_id < customer_id, which will delete all customers except for the one with the highest ID.

If we run this query, we get this result.

Result:

1220 rows deleted.

You could change the query to use > instead of < if you want to keep the rows with the lowest ID.

 

Method 3 - MIN or MAX Function

Database: Oracle, SQL Server, MySQL, PostgreSQL

This method uses either the MIN or MAX function to find duplicates inside a subquery.

It’s similar to earlier examples, but it uses fewer subqueries.

This method only works if you have a unique value for each row. If there are some duplicated values (e.g. if the ID for the row is duplicated), then it won't work.

This example here uses the MIN function, but you can use the MAX function to get the same result.

The syntax looks like this:

1DELETE FROM your_table
2WHERE rowid not in (
3  SELECT MIN(rowid)
4  FROM your_table
5  GROUP BY column1, column2, column3
6);

The columns in the GROUP BY statement are those that are checked for duplicates. In the example I’ve been using, this is the first_name, last_name, and created_date.

If I run it as a SELECT COUNT(*) first, I can see the number of rows impacted.

1SELECT COUNT(*)
2FROM customer
3WHERE rowid NOT IN (
4  SELECT MIN(rowid)
5  FROM customer
6  GROUP BY first_name, last_name
7);

Result:

1220 rows.

Now, I can run the DELETE statement.

1DELETE FROM customer
2WHERE rowid NOT IN (
3  SELECT MIN(rowid)
4  FROM customer
5  GROUP BY first_name, last_name
6);

Result:

1220 rows deleted.

Note: This type of query can run in MySQL, but it shows incorrect results. This is because MySQL does not have an equivalent of ROWID, which is a unique value for each row.

 

Method 4 - DENSE_RANK

Database: Oracle, SQL Server, PostgreSQL

Not: MySQL

This method is similar to method 1 but uses the DENSE_RANK method.

In fact, it’s almost exactly the same.

The syntax looks like this:

 1DELETE FROM table a
 2WHERE a.ROWID IN (
 3    SELECT rid FROM (
 4        SELECT ROWID rid,
 5        DENSE_RANK() OVER(
 6          PARTITION BY unique_columns
 7          ORDER BY ROWID) dup
 8        FROM table
 9    )
10    WHERE dup > 1
11);

It uses the same concept as ROW_NUMBER, but uses the DENSE_RANK function. There are some differences between using these functions, but in this case, they give the same output.

Let’s find the COUNT(*) of records first.

 1SELECT COUNT(*) FROM customer 
 2WHERE ROWID IN (
 3   SELECT rid FROM (
 4      SELECT ROWID rid,
 5       DENSE_RANK() OVER(
 6         PARTITION BY first_name, last_name
 7         ORDER BY ROWID) dup
 8       FROM customer
 9    )
10    WHERE dup > 1
11);

Result:

1220 rows.

This shows the same number of records as methods 1 and 2.

Now, let’s try the DELETE statement.

 1DELETE FROM customer 
 2WHERE ROWID IN (
 3    SELECT rid FROM (
 4      SELECT ROWID rid,
 5      DENSE_RANK() OVER(
 6        PARTITION BY first_name, last_name
 7        ORDER BY ROWID) dup
 8      FROM customer
 9    )
10    WHERE dup > 1
11);

Result:

1220 rows deleted.

This deletes all of the duplicate records we found.

In this method, you can also use a RANK function instead of DENSE_RANK. It should show the same results.

 

Method 5 - Correlated Subquery with MIN or MAX

Database: Oracle

Not: MySQL, SQL Server, PostgreSQL

The next method I’ll share is similar to method 4 but uses a correlated subquery to match on columns.

1DELETE
2FROM table a
3WHERE ROWID NOT IN (
4  SELECT MAX(ROWID)
5  FROM table b
6  WHERE a.col1 = b.col1
7  AND a.col2 = b.col2
8  AND a.col3 = b.col3
9);

It’s similar to the earlier query, but instead of using a GROUP BY clause, we use a WHERE clause. This WHERE clause joins the table inside the subquery to the table outside the subquery. The tables are joined on the columns that match.

Let’s see how many records are found.

1SELECT COUNT(*)
2FROM customer a
3WHERE ROWID NOT IN (
4  SELECT MAX(ROWID)
5  FROM customer b
6  WHERE a.first_name = b.first_name
7  AND a.last_name = b.last_name
8);

Result:

1220 rows.

Now, let’s run this as a DELETE query:

1DELETE
2FROM customer a
3WHERE ROWID NOT IN (
4  SELECT MAX(ROWID)
5  FROM customer b
6  WHERE a.first_name = b.first_name
7  AND a.last_name = b.last_name
8);

Result:

1220 rows deleted.

Note: This type of query can run in MySQL, but it shows incorrect results. This is because MySQL does not have an equivalent of ROWID, which is a unique value for each row.

 

Method 6: Use a Subquery with ANY

Database: Oracle

Not: MySQL, SQL Server, PostgreSQL

The next method we’ll look at is using a subquery to identify and delete duplicate data.

I’ll show you the query first, then explain how it works.

1DELETE FROM tablename a
2WHERE a.rowid > ANY (
3  SELECT b.rowid
4  FROM tablename b
5  WHERE a.column1 = b.column1
6);

Now, there are a few things to note about this query.

Both the subquery and the outer query should be selecting from the same table, but use different table aliases. In this example, I have used a and b.

The WHERE clause in the subquery is used to define how you identify a duplicate. You write your WHERE clause that joins on every column that you consider when you identify duplicates. This could be one field, two fields, or every field in the table.

The rowid pseudocolumn is used to identify rows in the subquery because this is guaranteed to be unique.

The WHERE clause of the outer query uses a > ANY condition to check for duplicates. It will delete any row that has a rowid greater than at least one other row. This ensures that all but one of the rows that match your conditions is met, therefore removing all duplicates.

So, how does this method perform?

Using this query on our sample data looks like this:

1DELETE
2FROM customer a
3WHERE a.rowid > ANY (
4  SELECT b.rowid
5  FROM customer b
6  WHERE a.first_name = b.first_name
7  AND a.last_name = b.last_name
8  AND a.address = b.address
9);

If we run this as a SELECT COUNT(*), we can see the number of records impacted:

1SELECT COUNT(*)
2FROM customer a
3WHERE a.rowid > ANY (
4  SELECT b.rowid
5  FROM customer b
6  WHERE a.first_name = b.first_name
7  AND a.last_name = b.last_name
8  AND a.address = b.address
9);

Result:

COUNT(*)
220

This query has found 220 duplicate values. It’s the same as Example 1.

Now, let’s run it as a DELETE query.

1DELETE
2FROM customer a
3WHERE a.rowid > ANY (
4  SELECT b.rowid
5  FROM customer b
6  WHERE a.first_name = b.first_name
7  AND a.last_name = b.last_name
8  AND a.address = b.address
9);

The result is:

1220 rows deleted.

 

Other Methods You Might Come Across

I’ve only found a few different ways for deleting duplicate data in SQL. I’ve listed a few more below, which don’t actually work as well as the others, if at all.

I’ve added these so you can see if specific methods work, for example if someone at work mentions a method they have tried. Or, if you’ve found a method that works on SQL Server and want to try it on an Oracle database.

So, here are the other methods I found to delete duplicate data.

Just to be clear, these don’t work, and they are just here for reference.

 

Method 7: Use an Intermediate Table

This method is a little different from the others, as we use several steps. We create a table to store the de-duplicated data, then update the main table with it.

Here are the steps:

  1. Create a new table that has the same structure as the original table.
  2. Insert the distinct (de-duplicated) rows from the original table to the new table
  3. Drop the original table
  4. Rename the new table to the original table

This is how we do it.

First, we create a new table with the same structure:

1CREATE TABLE customer_int AS
2SELECT customer_id, first_name, last_name, address
3FROM customer
4WHERE 1=0;

Next, we populate the new customer_int table with unique records using SELECT and a GROUP BY:

1SELECT customer_id, first_name, last_name, address
2FROM customer
3GROUP BY first_name, last_name, address;

However, this generates an error message:

Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test.customer.customer_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

It causes an error because of an incorrect use of GROUP BY.

This may have worked in earlier versions of MySQL, but not in recent versions. Other databases may show similar errors.

There may be a way to populate an intermediate table using a different method, but it may just be easier to delete the records using another method in this guide.

 

Method 8 - DELETE WHERE IN

This method uses a WHERE clause with specified columns that are checked against a subquery.

It seems very similar to method 4, with the addition of the three columns instead of using ROWID.

The method looks like this:

1DELETE FROM table WHERE
2(col1, col2, col3) IN (
3  SELECT MAX(col1), col2, col3
4  FROM table
5  GROUP BY col2, col3
6);

Here's the query on our database:

1DELETE FROM customer
2WHERE (customer_id, first_name, last_name, address) IN (
3SELECT MAX(customer_id), first_name, last_name, address
4FROM customer
5GROUP BY first_name, last_name, address
6);

In Oracle, I ran this query for a few minutes but then stopped it without getting a result. I imagine it runs slowly because of the WHERE clause on the three fields, rather than the ROWID mentioned earlier.

In MySQL, I ran this and got an error:

Error Code: 1093. You can't specify target table 'customer' for update in FROM clause

This is because you can't specify the table in the FROM clause and in the WHERE clause.

This query may work for you if one of the fields you’re looking at is an ID or some kind of number (to get the MAX value of), and if you have indexes on these fields.

But, for me, this query did not work.

 

Method 9 - DELETE with LEFT OUTER JOIN

This method uses a LEFT OUTER JOIN from the table that has the records to delete, to a subquery that contains the duplicate records.

 1DELETE (
 2    SELECT d.*, d.rowid
 3    FROM customer d
 4    LEFT OUTER JOIN (
 5        SELECT MIN(RowId) AS MinRowId,
 6        first_name, last_name, address
 7        FROM customer
 8        GROUP BY first_name, last_name, address
 9        ) KeepRows ON d.RowId = KeepRows.MinRowId
10    WHERE KeepRows.MinRowId IS NULL;

Let’s check the COUNT first:

1SELECT COUNT(*)
2FROM customer
3LEFT OUTER JOIN (
4    SELECT MIN(RowId) AS MinRowId,
5    first_name, last_name, address
6    FROM customer
7    GROUP BY first_name, last_name, address
8    ) KeepRows ON all_customer_dupes.RowId = KeepRows.MinRowId
9WHERE KeepRows.MinRowId IS NULL;
COUNT(*)
176

It shows an incorrect number of duplicate records. We could run this as a DELETE command on SQL Server and the rows will be deleted.

If we are on Oracle, we can try to run this as a DELETE command.

 1DELETE (
 2    SELECT d.*, d.rowid
 3    FROM customer d
 4    LEFT OUTER JOIN (
 5        SELECT MIN(RowId) AS MinRowId,
 6        first_name, last_name, address
 7        FROM customer
 8        GROUP BY first_name, last_name, address
 9        ) KeepRows ON d.RowId = KeepRows.MinRowId
10    WHERE KeepRows.MinRowId IS NULL

Result:

1SQL Error: ORA-01752: cannot delete from view without exactly one key-preserved table 01752. 00000 -  "cannot delete from view without exactly one key-preserved table" \*Cause:    The deleted table had - no key-preserved tables, - more than one key-preserved table, or - the key-preserved table was an unmerged view. \*Action:   Redefine the view or delete it from the underlying base tables.

This query did not work for me and showed this error on Oracle.

 

Method 10 - DELETE with WITH Clause

This method is also based on a concept that works in SQL Server called CTE or Common Table Expressions.

The query looks like this:

1WITH cte AS
2(SELECT ROW_NUMBER() 
3    OVER (PARTITION BY first_name, last_name, address
4    ORDER BY (customer_id)) AS RN
5FROM customer)
6DELETE FROM cte
7WHERE rn > 1;

In SQL Server, you can run this query and rows are deleted.

1220 rows deleted.

In Oracle, if we run this query, we get an error message:

ORA-00928: missing SELECT keyword 00928. 00000 -  "missing SELECT keyword" *Cause: *Action:

This is because, in Oracle, the WITH clause needs to have a SELECT query that follows it, and not a DELETE query. It doesn’t work with DELETE statements.

 

Method 11 - Two Tables Joined

This method uses two tables joined to each other on the matching columns, and a ROWID check to only find those that are unique. I’ve specified the table to delete after the DELETE keyword so the database will know which table I want to delete from.

1DELETE (
2  SELECT t1.rowid
3  FROM all_customer_dupes t1
4  INNER JOIN all_customer_dupes t2
5  ON t1.first_name = t2.first_name
6  AND t1.last_name = t2.last_name
7  AND t1.created_date = t2.created_date
8  AND t1.rowid > t2.rowid
9);

Let’s try getting a count first:

1SELECT COUNT(t1.customer_id)
2FROM all_customer_dupes t1
3INNER JOIN all_customer_dupes t2
4ON t1.first_name = t2.first_name
5AND t1.last_name = t2.last_name
6AND t1.created_date = t2.created_date
7AND t1.rowid > t2.rowid;

Records:

140,267 rows.

It shows the right amount of rows. Now, let’s try to delete them.

1DELETE (
2  SELECT t1.rowid
3  FROM all_customer_dupes t1
4  INNER JOIN all_customer_dupes t2
5  ON t1.first_name = t2.first_name
6  AND t1.last_name = t2.last_name
7  AND t1.created_date = t2.created_date
8  AND t1.rowid > t2.rowid
9);

Result (in Oracle):

1SQL Error: ORA-01752: cannot delete from view without exactly one key-preserved table 01752. 00000 -  "cannot delete from view without exactly one key-preserved table" \*Cause:    The deleted table had - no key-preserved tables, - more than one key-preserved table, or - the key-preserved table was an unmerged view. \*Action:   Redefine the view or delete it from the underlying base tables.

It shows a similar error as the earlier method.

This happens because Oracle does not support two tables in a DELETE statement. Even though we have specified which one we want to delete from, it’s not allowed in Oracle.

 

Method 12 - Inline View and Analytic Function

This method is similar to earlier methods that work but uses an inline view instead of an actual table.

1DELETE
2FROM (SELECT c.*,
3    ROW_NUMBER()    
4        OVER (PARTITION BY first_name, last_name, created_date
5        ORDER BY ROWID DESC) AS RN
6    FROM all_customer_dupes c
7) sub
8WHERE rn > 1;

Instead of saying DELETE FROM all_customer_dupes, I am deleting from the subquery itself.

Let’s try to COUNT the number of records.

1SELECT COUNT(*)
2FROM (SELECT c.*,
3    ROW_NUMBER()    
4        OVER (PARTITION BY first_name, last_name, created_date
5        ORDER BY ROWID DESC) AS RN
6    FROM all_customer_dupes c
7) sub
8WHERE rn > 1;

Result:

140,267 rows.

The count is correct. Now, let’s try a delete.

1DELETE
2FROM (SELECT c.*,
3    ROW_NUMBER()    
4        OVER (PARTITION BY first_name, last_name, created_date
5        ORDER BY ROWID DESC) AS RN
6    FROM all_customer_dupes c
7) sub
8WHERE rn > 1;

Result in Oracle:

1SQL Error: ORA-01732: data manipulation operation not legal on this view 01732. 00000 -  "data manipulation operation not legal on this view" \*Cause: \*Action:

This error occurs because Oracle does not allow deletions on inline views like this. You must delete from an actual table.

We get a similar error with MySQL.

 

Method 13 - Use RANK and Primary Key

This method is another one I found online that does not work:

 1DELETE FROM table
 2WHERE ROWID IN (
 3  SELECT "rowid"
 4  FROM (
 5    SELECT "rowid", rank_n
 6    FROM (
 7      SELECT RANK() OVER (
 8        PARTITION BY unique_columns ORDER BY rowid) rank_n,
 9      rowid as "rowid"
10      FROM table
11      WHERE primary_key IN (
12        SELECT primary_key
13        FROM table
14        GROUP BY all_columns
15        HAVING COUNT(*) > 1
16      )
17   )
18  )
19  WHERE rank_n > 1
20);

This query looks like it uses a RANK function as an analytic function with a primary key like earlier examples, but it adds a subquery in the WHERE clause. This subquery causes an error because the GROUP BY cannot be used with columns that are not in the SELECT clause.

So, this function does not even run as it has syntax errors.

 

Conclusion

In conclusion, there are a few ways to delete duplicate records from your table with SQL. Some work for all databases, and some only work for a couple of databases.

If you want to improve your SQL skills, enter your email below!

If you had any comments or questions on removing duplicates in SQL, let me know in the comments below.

comments powered by Disqus