FB

How to Remove Duplicate Records in SQL

Do you need to use SQL to remove duplicates in your tables? Learn how to write an SQL query to remove duplicate data in this article.

The Problem – Removing Duplicates in SQL

Let’s say you have a table with some data in it. You’ve found out that there is some duplicate data in this table. And you want to get rid of the duplicates.SQL Remove Duplicates

The way you define duplicate data could be dependant on your data.

  • Is it a duplicate if all of the columns are the same?
  • Is it a duplicate if all columns except for the primary key are the same?
  • Is it a duplicate if only a few columns are the same?

In any case, identifying and removing duplicates is possible in SQL.

There are several ways to do it. I’ll explain each of these methods.

We’ll see how it can work on Oracle, SQL Server, MySQL, and PostgreSQL.

The sample of data has 1,220 records in a single table, which looks like this:

customer_id first_name last_name address
1 ‘Bamby’ ‘Applin’ 29 Lighthouse Bay Lane
2 ‘Blinny’ ‘Kirman’ 96024 Jenifer Center
3 ‘Rodie’ ‘Dils’ 62729 Debs Point
4 ‘Abdel’ ‘Grisedale’ 711 Lawn Park
5 ‘Fernanda’ ‘Lyokhin’ 54 Dorton Place

 

Let’s say that a record is a duplicate if it contains the same first_name and last_name values.

Let’s take a look at the different ways to remove duplicates in SQL.

Summary of Methods

Here’s a summary of the different methods and which databases they work on.

Method Oracle SQL Server MySQL PostgreSQL
ROW_NUMBER Y Y Y Y
Delete with Join Y Y Y Y
MIN or MAX Y Y Y Y
DENSE_RANK Y Y N Y
Correlated Subquery with MIN or MAX Y N N N
Subquery with ANY Y N N N

 

A Note on Query Times

In each of these examples, I explain the code I am using, what it does, and delete data using the DELETE statement.

However, any query times should only be taken as a guide, and may be different from the performance you get:

  • Your table structure, unique data business rules, and number of records will be different
  • Your database version and specs will be different
  • Several other factors may impact performance such as query cache or background processes on my computer.

So, I guess my point is that you might not get the same performance as I do. But, if you try one method and it has poor performance for your data, try one of the other methods.

 

Method 1 – ROW_NUMBER Analytic Function

Database: Oracle, MySQL, SQL Server, PostgreSQL

The first method I’ll show you is using an analytic function called ROW_NUMBER. It’s been recommended in several places such as StackOverflow questions and an AskTOM thread.

It involves several steps:

  1. Using a subquery to find each ROWID (which is a unique number given to each row in an Oracle table) and the ROW_NUMBER function to find a sequential number for that row, grouped by the fields you specify as unique.
  2. Find the ROWID values that are identified as duplicates.
  3. Delete rows that match these ROWIDs.

The query looks like this:

DELETE FROM table a
WHERE a.ROWID IN
 (SELECT ROWID FROM
   (SELECT
    ROWID,
    ROW_NUMBER() OVER
    (PARTITION BY unique_columns ORDER BY ROWID) dup
    FROM table)
  WHERE dup > 1);

The ROW_NUMBER function here is used as an analytic function. It uses the PARTITION BY to create partitions or groups based on the fields I’ve mentioned in the PARTITION BY (which are first_name, last_name, and created_date). I’ve then ordered them by the ROWID.

This means that the first occurrence of those fields will get the number of 1. The second occurrence of all of those fields will get a number of 2, and so on.

The ROWIDs are then returned to the DELETE statement at the top, which only deletes records where the ROW_NUMBER function (which has an alias of “dup” in this example) are greater than one. (The AskTOM thread uses “WHERE dup <> 1” but it achieves the same thing).

It’s a good idea to check the records you’re deleting first, by running a SELECT statement using this criteria. For the purposes of these examples, I’ll check the COUNT of the records about to be deleted, by replacing the DELETE with a SELECT COUNT(*).

SELECT COUNT(*)
FROM customer a
WHERE a.ROWID IN
  (SELECT ROWID FROM
     (SELECT
      ROWID,
      ROW_NUMBER() OVER
        (PARTITION BY first_name, last_name, address
         ORDER BY ROWID) dup
    FROM customer)
    WHERE dup > 1);

Result:

220

It shows there are 220 duplicate records.

In MySQL, or other databases, your query may look like this:

SELECT COUNT(*)
FROM customer a
WHERE a.customer_id IN
  (SELECT customer_id FROM
     (SELECT
      customer_id,
      ROW_NUMBER() OVER
        (PARTITION BY first_name, last_name, address
         ORDER BY customer_id) dup
    FROM customer)
    WHERE dup > 1);

If I run this command as a DELETE statement:

DELETE FROM customer a
WHERE a.ROWID IN
  (SELECT ROWID FROM
    (SELECT
    ROWID,
    ROW_NUMBER() OVER
      (PARTITION BY first_name, last_name, address) dup
    FROM customer)
  WHERE dup > 1);

Result:

220 records deleted.

This is the method I would use if I needed to delete duplicate records from a table. It uses a ROWID filter which is usually a fast way to access a table.

 

Method 2: Delete with JOIN

Database: Oracle, SQL Server, MySQL, PostgreSQL

This is a commonly recommended method for MySQL and works for all other databases. It involves joining the same table to itself, specifying the matching columns, and deleting all but one duplicate row.

Here’s the sample query:

DELETE t1
FROM customer t1
INNER JOIN customer t2
ON t1.customer_id < t2.customer_id
AND t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.address = t2.address;

We specify the customer table twice. We specify the first_name, last_name, and address fields, which identify the duplicate row.

We also specify customer_id < customer_id, which will delete all customers except for the one with the highest ID.

If we run this query, we get this result.

Result:

220 rows deleted.

You could change the query to use > instead of < if you want to keep the rows with the lowest ID.

 

Method 3 – MIN or MAX Function

Database: Oracle, SQL Server, MySQL, PostgreSQL

This method uses either the MIN or MAX function to find duplicates inside a subquery.

It’s similar to earlier examples, but it uses fewer subqueries.

This method only works if you have a unique value for each row. If there are some duplicated values (e.g. if the ID for the row is duplicated), then it won’t work.

This example here uses the MIN function, but you can use the MAX function to get the same result.

The syntax looks like this:

DELETE FROM your_table
WHERE rowid not in (
  SELECT MIN(rowid)
  FROM your_table
  GROUP BY column1, column2, column3
);

The columns in the GROUP BY statement are those that are checked for duplicates. In the example I’ve been using, this is the first_name, last_name, and created_date.

If I run it as a SELECT COUNT(*) first, I can see the number of rows impacted.

SELECT COUNT(*)
FROM customer
WHERE rowid NOT IN (
  SELECT MIN(rowid)
  FROM customer
  GROUP BY first_name, last_name
);

Result:

220 rows.

Now, I can run the DELETE statement.

DELETE FROM customer
WHERE rowid NOT IN (
  SELECT MIN(rowid)
  FROM customer
  GROUP BY first_name, last_name
);

Result:

220 rows deleted.

Note: This type of query can run in MySQL, but it shows incorrect results. This is because MySQL does not have an equivalent of ROWID, which is a unique value for each row.

 

Method 4 – DENSE_RANK

Database: Oracle, SQL Server, PostgreSQL

Not: MySQL

This method is similar to method 1 but uses the DENSE_RANK method.

In fact, it’s almost exactly the same.

The syntax looks like this:

DELETE FROM table a
WHERE a.ROWID IN (
    SELECT rid FROM (
        SELECT ROWID rid,
        DENSE_RANK() OVER(
          PARTITION BY unique_columns
          ORDER BY ROWID) dup
        FROM table
    )
    WHERE dup > 1
);

It uses the same concept as ROW_NUMBER, but uses the DENSE_RANK function. There are some differences between using these functions, but in this case, they give the same output.

Let’s find the COUNT(*) of records first.

SELECT COUNT(*) FROM customer 
WHERE ROWID IN (
   SELECT rid FROM (
      SELECT ROWID rid,
       DENSE_RANK() OVER(
         PARTITION BY first_name, last_name
         ORDER BY ROWID) dup
       FROM customer
    )
    WHERE dup > 1
);

Result:

220 rows.

This shows the same number of records as methods 1 and 2.

Now, let’s try the DELETE statement.

DELETE FROM customer 
WHERE ROWID IN (
    SELECT rid FROM (
      SELECT ROWID rid,
      DENSE_RANK() OVER(
        PARTITION BY first_name, last_name
        ORDER BY ROWID) dup
      FROM customer
    )
    WHERE dup > 1
);

Result:

220 rows deleted.

This deletes all of the duplicate records we found.

In this method, you can also use a RANK function instead of DENSE_RANK. It should show the same results.

 

Method 5 – Correlated Subquery with MIN or MAX

Database: Oracle

Not: MySQL, SQL Server, PostgreSQL

The next method I’ll share is similar to method 4 but uses a correlated subquery to match on columns.

DELETE
FROM table a
WHERE ROWID NOT IN (
  SELECT MAX(ROWID)
  FROM table b
  WHERE a.col1 = b.col1
  AND a.col2 = b.col2
  AND a.col3 = b.col3
);

It’s similar to the earlier query, but instead of using a GROUP BY clause, we use a WHERE clause. This WHERE clause joins the table inside the subquery to the table outside the subquery. The tables are joined on the columns that match.

Let’s see how many records are found.

SELECT COUNT(*)
FROM customer a
WHERE ROWID NOT IN (
  SELECT MAX(ROWID)
  FROM customer b
  WHERE a.first_name = b.first_name
  AND a.last_name = b.last_name
);

Result:

220 rows.

Now, let’s run this as a DELETE query:

DELETE
FROM customer a
WHERE ROWID NOT IN (
  SELECT MAX(ROWID)
  FROM customer b
  WHERE a.first_name = b.first_name
  AND a.last_name = b.last_name
);

Result:

220 rows deleted.

Note: This type of query can run in MySQL, but it shows incorrect results. This is because MySQL does not have an equivalent of ROWID, which is a unique value for each row.

 

Method 6: Use a Subquery with ANY

Database: Oracle

Not: MySQL, SQL Server, PostgreSQL

The next method we’ll look at is using a subquery to identify and delete duplicate data.

I’ll show you the query first, then explain how it works.

DELETE FROM tablename a
WHERE a.rowid > ANY (
  SELECT b.rowid
  FROM tablename b
  WHERE a.column1 = b.column1
);

Now, there are a few things to note about this query.

Both the subquery and the outer query should be selecting from the same table, but use different table aliases. In this example, I have used a and b.

The WHERE clause in the subquery is used to define how you identify a duplicate. You write your WHERE clause that joins on every column that you consider when you identify duplicates. This could be one field, two fields, or every field in the table.

The rowid pseudocolumn is used to identify rows in the subquery because this is guaranteed to be unique.

The WHERE clause of the outer query uses a > ANY condition to check for duplicates. It will delete any row that has a rowid greater than at least one other row. This ensures that all but one of the rows that match your conditions is met, therefore removing all duplicates.

So, how does this method perform?

Using this query on our sample data looks like this:

DELETE
FROM customer a
WHERE a.rowid > ANY (
  SELECT b.rowid
  FROM customer b
  WHERE a.first_name = b.first_name
  AND a.last_name = b.last_name
  AND a.address = b.address
);

If we run this as a SELECT COUNT(*), we can see the number of records impacted:

SELECT COUNT(*)
FROM customer a
WHERE a.rowid > ANY (
  SELECT b.rowid
  FROM customer b
  WHERE a.first_name = b.first_name
  AND a.last_name = b.last_name
  AND a.address = b.address
);

Result:

COUNT(*)
220

This query has found 220 duplicate values. It’s the same as Example 1.

Now, let’s run it as a DELETE query.

DELETE
FROM customer a
WHERE a.rowid > ANY (
  SELECT b.rowid
  FROM customer b
  WHERE a.first_name = b.first_name
  AND a.last_name = b.last_name
  AND a.address = b.address
);

The result is:

220 rows deleted.

 

Other Methods You Might Come Across

I’ve only found a few different ways for deleting duplicate data in SQL. I’ve listed a few more below, which don’t actually work as well as the others, if at all.

I’ve added these so you can see if specific methods work, for example if someone at work mentions a method they have tried. Or, if you’ve found a method that works on SQL Server and want to try it on an Oracle database.

So, here are the other methods I found to delete duplicate data.

Just to be clear, these don’t work, and they are just here for reference.

 

Method 7: Use an Intermediate Table

This method is a little different from the others, as we use several steps. We create a table to store the de-duplicated data, then update the main table with it.

Here are the steps:

  1. Create a new table that has the same structure as the original table.
  2. Insert the distinct (de-duplicated) rows from the original table to the new table
  3. Drop the original table
  4. Rename the new table to the original table

This is how we do it.

First, we create a new table with the same structure:

CREATE TABLE customer_int AS
SELECT customer_id, first_name, last_name, address
FROM customer
WHERE 1=0;

Next, we populate the new customer_int table with unique records using SELECT and a GROUP BY:

SELECT customer_id, first_name, last_name, address
FROM customer
GROUP BY first_name, last_name, address;

However, this generates an error message:

Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column ‘test.customer.customer_id’ which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

It causes an error because of an incorrect use of GROUP BY.

This may have worked in earlier versions of MySQL, but not in recent versions. Other databases may show similar errors.

There may be a way to populate an intermediate table using a different method, but it may just be easier to delete the records using another method in this guide.

 

Method 8 – DELETE WHERE IN

This method uses a WHERE clause with specified columns that are checked against a subquery.

It seems very similar to method 4, with the addition of the three columns instead of using ROWID.

The method looks like this:

DELETE FROM table WHERE
(col1, col2, col3) IN (
  SELECT MAX(col1), col2, col3
  FROM table
  GROUP BY col2, col3
);

Here’s the query on our database:

DELETE FROM customer
WHERE (customer_id, first_name, last_name, address) IN (
SELECT MAX(customer_id), first_name, last_name, address
FROM customer
GROUP BY first_name, last_name, address
);

In Oracle, I ran this query for a few minutes but then stopped it without getting a result. I imagine it runs slowly because of the WHERE clause on the three fields, rather than the ROWID mentioned earlier.

In MySQL, I ran this and got an error:

Error Code: 1093. You can’t specify target table ‘customer’ for update in FROM clause

This is because you can’t specify the table in the FROM clause and in the WHERE clause.

This query may work for you if one of the fields you’re looking at is an ID or some kind of number (to get the MAX value of), and if you have indexes on these fields.

But, for me, this query did not work.

 

Method 9 – DELETE with LEFT OUTER JOIN

This method uses a LEFT OUTER JOIN from the table that has the records to delete, to a subquery that contains the duplicate records.

DELETE (
    SELECT d.*, d.rowid
    FROM customer d
    LEFT OUTER JOIN (
        SELECT MIN(RowId) AS MinRowId,
        first_name, last_name, address
        FROM customer
        GROUP BY first_name, last_name, address
        ) KeepRows ON d.RowId = KeepRows.MinRowId
    WHERE KeepRows.MinRowId IS NULL;

Let’s check the COUNT first:

SELECT COUNT(*)
FROM customer
LEFT OUTER JOIN (
    SELECT MIN(RowId) AS MinRowId,
    first_name, last_name, address
    FROM customer
    GROUP BY first_name, last_name, address
    ) KeepRows ON all_customer_dupes.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;
COUNT(*)
176

It shows an incorrect number of duplicate records. We could run this as a DELETE command on SQL Server and the rows will be deleted.

If we are on Oracle, we can try to run this as a DELETE command.

DELETE (
    SELECT d.*, d.rowid
    FROM customer d
    LEFT OUTER JOIN (
        SELECT MIN(RowId) AS MinRowId,
        first_name, last_name, address
        FROM customer
        GROUP BY first_name, last_name, address
        ) KeepRows ON d.RowId = KeepRows.MinRowId
    WHERE KeepRows.MinRowId IS NULL

Result:

SQL Error: ORA-01752: cannot delete from view without exactly one key-preserved table
01752. 00000 –  “cannot delete from view without exactly one key-preserved table”
*Cause:    The deleted table had
– no key-preserved tables,
– more than one key-preserved table, or
– the key-preserved table was an unmerged view.
*Action:   Redefine the view or delete it from the underlying base tables.

This query did not work for me and showed this error on Oracle.

 

Method 10 – DELETE with WITH Clause

This method is also based on a concept that works in SQL Server called CTE or Common Table Expressions.

The query looks like this:

WITH cte AS
(SELECT ROW_NUMBER() 
    OVER (PARTITION BY first_name, last_name, address
    ORDER BY (customer_id)) AS RN
FROM customer)
DELETE FROM cte
WHERE rn > 1;

In SQL Server, you can run this query and rows are deleted.

220 rows deleted.

In Oracle, if we run this query, we get an error message:

ORA-00928: missing SELECT keyword
00928. 00000 –  “missing SELECT keyword”
*Cause:
*Action:

This is because, in Oracle, the WITH clause needs to have a SELECT query that follows it, and not a DELETE query. It doesn’t work with DELETE statements.

 

Method 11 – Two Tables Joined

This method uses two tables joined to each other on the matching columns, and a ROWID check to only find those that are unique. I’ve specified the table to delete after the DELETE keyword so the database will know which table I want to delete from.

DELETE (
  SELECT t1.rowid
  FROM all_customer_dupes t1
  INNER JOIN all_customer_dupes t2
  ON t1.first_name = t2.first_name
  AND t1.last_name = t2.last_name
  AND t1.created_date = t2.created_date
  AND t1.rowid > t2.rowid
);

Let’s try getting a count first:

SELECT COUNT(t1.customer_id)
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid;

Records:

40,267 rows.

It shows the right amount of rows. Now, let’s try to delete them.

DELETE (
  SELECT t1.rowid
  FROM all_customer_dupes t1
  INNER JOIN all_customer_dupes t2
  ON t1.first_name = t2.first_name
  AND t1.last_name = t2.last_name
  AND t1.created_date = t2.created_date
  AND t1.rowid > t2.rowid
);

Result (in Oracle):

SQL Error: ORA-01752: cannot delete from view without exactly one key-preserved table
01752. 00000 –  “cannot delete from view without exactly one key-preserved table”
*Cause:    The deleted table had
– no key-preserved tables,
– more than one key-preserved table, or
– the key-preserved table was an unmerged view.
*Action:   Redefine the view or delete it from the underlying base tables.

It shows a similar error as the earlier method.

This happens because Oracle does not support two tables in a DELETE statement. Even though we have specified which one we want to delete from, it’s not allowed in Oracle.

 

Method 12 – Inline View and Analytic Function

This method is similar to earlier methods that work but uses an inline view instead of an actual table.

DELETE
FROM (SELECT c.*,
    ROW_NUMBER()    
        OVER (PARTITION BY first_name, last_name, created_date
        ORDER BY ROWID DESC) AS RN
    FROM all_customer_dupes c
) sub
WHERE rn > 1;

Instead of saying DELETE FROM all_customer_dupes, I am deleting from the subquery itself.

Let’s try to COUNT the number of records.

SELECT COUNT(*)
FROM (SELECT c.*,
    ROW_NUMBER()    
        OVER (PARTITION BY first_name, last_name, created_date
        ORDER BY ROWID DESC) AS RN
    FROM all_customer_dupes c
) sub
WHERE rn > 1;

Result:

40,267 rows.

The count is correct. Now, let’s try a delete.

DELETE
FROM (SELECT c.*,
    ROW_NUMBER()    
        OVER (PARTITION BY first_name, last_name, created_date
        ORDER BY ROWID DESC) AS RN
    FROM all_customer_dupes c
) sub
WHERE rn > 1;

Result in Oracle:

SQL Error: ORA-01732: data manipulation operation not legal on this view
01732. 00000 –  “data manipulation operation not legal on this view”
*Cause:
*Action:

This error occurs because Oracle does not allow deletions on inline views like this. You must delete from an actual table.

We get a similar error with MySQL.

 

Method 13 – Use RANK and Primary Key

This method is another one I found online that does not work:

DELETE FROM table
WHERE ROWID IN (
  SELECT "rowid"
  FROM (
    SELECT "rowid", rank_n
    FROM (
      SELECT RANK() OVER (
        PARTITION BY unique_columns ORDER BY rowid) rank_n,
      rowid as "rowid"
      FROM table
      WHERE primary_key IN (
        SELECT primary_key
        FROM table
        GROUP BY all_columns
        HAVING COUNT(*) > 1
      )
   )
  )
  WHERE rank_n > 1
);

This query looks like it uses a RANK function as an analytic function with a primary key like earlier examples, but it adds on a subquery in the WHERE clause. This subquery causes an error, because the GROUP BY cannot be used with columns that are not in the SELECT clause.

So, this function does not even run as it has syntax errors.

 

Conclusion

In conclusion, there are a few ways to delete duplicate records from your table with SQL. Some work for all databases, and some only work for a couple of databases.

If you want to improve your SQL skills, enter your email below!

If you had any comments or questions on removing duplicates in SQL, let me know in the comments below.

12 thoughts on “How to Remove Duplicate Records in SQL”

  1. Hi Ben ,

    Is it possible to delete duplicate records without using Rowid. If Yes how?

    Could you pls let me add the same query in u r Blog

  2. In an MPP system (Netezza), the following works the best. What if the table does not have a PK and the table has a lot of columns?

    CREATE TABLE table_b AS
    SELECT * FROM table_a UNION SELECT * FROM table_a; /* This will dedup the data */

    TRUNCATE TABLE table_a; /* else we can also DROP TABLE table_a and rename table_b to table_a */
    INSERT INTO table_a SELECT * FROM table_b;
    DROP TABLE table_b;

    1. ITS NOT WORK ON SQL SERVER
      IF I HAVE TABLE AS EMP
      EMP_ID, EMP_NAME, DOB, CITY
      1, ABC, 2019-01-01, ggg
      1, ABC, 2019-02-1,GGG
      2, DEF,2019-3-3,HHH
      3, GHI, 2019-04-04,EEEE
      4, GHI, 2019-05-04,EEEE

  3. Andrea Pignatone

    Hi Ben,
    which of the Run-Times for the second method (Subquery with Any) is correct, 27.397 secs or 16.76 secs?

    Regards
    Andrea

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.