Oracle: Deleting Duplicate Rows Efficiently

Posted in: Technical Track

Finding duplicates with RANK
If you’ve made the mistake of forgetting a primary key on your table, it can be frustrating to find a way to delete all of the duplicate rows without deleting the initial instances.

delete from $table_name where rowid in
  (
  select "rowid" from
     (select "rowid", rank_n from
         (select rank() over (partition by $primary_key order by rowid) rank_n, rowid as "rowid"
             from $table_name
             where $primary_key in
                (select $primary_key from $table_name
                  group by $all_columns
                  having count(*) > 1
                )
             )
         )
     where rank_n > 1
  )

This query selects all of the ‘extra’ rowids and removes them. It is especially designed for limiting the query scans to only those records which have duplicates, which is useful if there’s only a subset of the table that you are dealing with. If you want to improve its efficiency for a table with a high percentage of duplicates, simply remove the inside where clause. In a sample test with 1% duplicates and 233 000 rows, the query took 22 seconds without the subquery, and 18 seconds with it.

* NOTE: This query may not work in Oracle versions before 8i.

A quick way to create duplicates:

create table $temp_table as select * from $table_name sample(10);
insert into $table_name select * from $temp_table;

Alternative methods
GROUP BY method

DELETE FROM $table_name
WHERE rowid not in
(SELECT MIN(rowid)
FROM $table_name
GROUP BY $all_columns;

Join with self method

SELECT
 rowid
FROM
   $table_name A
WHERE
   rowid >
     (SELECT min(rowid) FROM $table_name B
      WHERE
         A.$column1 = B.$column1
        and A.$column2 = B.$column2
        ......
      );

Comparison
In the test case, I used a table with 233 000 rows created with a 1% duplication factor. You’ll notice that I interrupted the execution of the ‘join with self’ because I didn’t have the patience to let it run to completion.

The following is a TKPROF output of the queries:

TKPROF: Release 8.1.7.4.0 - Production on Tue May 23 14:47:43 2006

(c) Copyright 2000 Oracle Corporation.  All rights reserved.

Trace file: flpdt_ora_28183.trc
Sort options: default

********************************************************************************

select "rowid" from
(select "rowid", rank_n from
(select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid"
from $schema.$table_name
where token_id in
(select token_id from $schema.$table_name
group by tournament_id, nickname, bank, place, prize_id, userid, token_id
having count(*) > :"SYS_B_0"
)
)
)
where rank_n > :"SYS_B_1"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.00       0.01          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch      315     29.98      73.11      14539       7578         94        4662
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      321     29.98      73.12      14539       7578         94        4662

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  VIEW
      0   WINDOW SORT
      0    MERGE JOIN
 230731     SORT JOIN
 230730      TABLE ACCESS FULL $table_name
      0     SORT JOIN
      0      VIEW VW_NSO_1
      0       SORT UNIQUE
      0        FILTER
 230731         SORT GROUP BY
 230730          TABLE ACCESS FULL $table_name

 ********************************************************************************

select "rowid" from
(select "rowid", rank_n from
(select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid"
from $schema.$table_name
)
)
where rank_n > :"SYS_B_0"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.03          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      157      7.67      15.47       2635       1257         15        2337
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      159      7.67      15.50       2635       1257         15        2337

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

********************************************************************************

SELECT MIN(rowid)
FROM $schema.$table_name
GROUP BY tournament_id, nickname, bank, place, prize_id, userid, token_id
having count(*) > :"SYS_B_0"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.00       0.00          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch      315     16.74      41.81       7832       3897         53        4662
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      321     16.74      41.81       7832       3897         53        4662

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  FILTER
 230731   SORT GROUP BY
 230730    TABLE ACCESS FULL $table_name

********************************************************************************

SELECT
 rowid
FROM
   $schema.$table_name A
WHERE
   rowid >
     (SELECT min(rowid) FROM $schema.$table_name B
      WHERE
         B.tournament_id = A.tournament_id
      and B.nickname = A.nickname
      and B.bank = A.bank
      and B.place = A.place
      and B.prize_id = A.prize_id
      and B.userid = A.userid
      and B.token_id = A.token_id
      )

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.02       0.01          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch        3    153.83     213.12        933     436118       2104           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        9    153.85     213.13        933     436118       2104           0

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  FILTER
     30   TABLE ACCESS FULL $table_name
     58   SORT AGGREGATE
      3    TABLE ACCESS FULL $table_name

Interested in working with Tim? Schedule a tech call.

5 Comments. Leave new

William Robertson
May 27, 2006 6:19 pm

Why not simply

DELETE table_name
WHERE rowid IN
( SELECT LEAD(rowid) OVER (PARTITION BY pk ORDER BY whatever)
FROM table_name );

Reply

I like using the LEAD function, and it works well for removing duplicates, but you need to be sure that every row has exactly one duplicate. In my test scenario, it performed poorly because I had only a subset of duplicates. In such systems, you would need to narrow the results to only problem records and also factor in the possibility of one record having multiple duplicates.

Reply

Hi,

Is it possible to delete alternate rows from a table?
If yes, can some one send me the query?

Thanks
Abhijeet.

Reply

Select alteranate rows

select * from Table where rowid%2=0;

Reply
finding / deleting duplicate rows « ur is not a word
April 14, 2010 12:05 pm
Reply

Leave a Reply

Your email address will not be published. Required fields are marked *