Oracle: Deleting Duplicate Rows Efficiently

May 23, 2006 / By Tim Procter

Tags: , ,

Finding duplicates with RANK
If you’ve made the mistake of forgetting a primary key on your table, it can be frustrating to find a way to delete all of the duplicate rows without deleting the initial instances.

delete from $table_name where rowid in
  (
  select "rowid" from
     (select "rowid", rank_n from
         (select rank() over (partition by $primary_key order by rowid) rank_n, rowid as "rowid"
             from $table_name
             where $primary_key in
                (select $primary_key from $table_name
                  group by $all_columns
                  having count(*) > 1
                )
             )
         )
     where rank_n > 1
  )

This query selects all of the ‘extra’ rowids and removes them. It is especially designed for limiting the query scans to only those records which have duplicates, which is useful if there’s only a subset of the table that you are dealing with. If you want to improve its efficiency for a table with a high percentage of duplicates, simply remove the inside where clause. In a sample test with 1% duplicates and 233 000 rows, the query took 22 seconds without the subquery, and 18 seconds with it.

* NOTE: This query may not work in Oracle versions before 8i.

A quick way to create duplicates:

create table $temp_table as select * from $table_name sample(10);
insert into $table_name select * from $temp_table;

Alternative methods
GROUP BY method

DELETE FROM $table_name
WHERE rowid not in
(SELECT MIN(rowid)
FROM $table_name
GROUP BY $all_columns;

Join with self method

SELECT
 rowid
FROM
   $table_name A
WHERE
   rowid >
     (SELECT min(rowid) FROM $table_name B
      WHERE
         A.$column1 = B.$column1
        and A.$column2 = B.$column2
        ......
      );

Comparison
In the test case, I used a table with 233 000 rows created with a 1% duplication factor. You’ll notice that I interrupted the execution of the ‘join with self’ because I didn’t have the patience to let it run to completion.

The following is a TKPROF output of the queries:

TKPROF: Release 8.1.7.4.0 - Production on Tue May 23 14:47:43 2006

(c) Copyright 2000 Oracle Corporation.  All rights reserved.

Trace file: flpdt_ora_28183.trc
Sort options: default

********************************************************************************

select "rowid" from
(select "rowid", rank_n from
(select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid"
from $schema.$table_name
where token_id in
(select token_id from $schema.$table_name
group by tournament_id, nickname, bank, place, prize_id, userid, token_id
having count(*) > :"SYS_B_0"
)
)
)
where rank_n > :"SYS_B_1"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.00       0.01          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch      315     29.98      73.11      14539       7578         94        4662
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      321     29.98      73.12      14539       7578         94        4662

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  VIEW
      0   WINDOW SORT
      0    MERGE JOIN
 230731     SORT JOIN
 230730      TABLE ACCESS FULL $table_name
      0     SORT JOIN
      0      VIEW VW_NSO_1
      0       SORT UNIQUE
      0        FILTER
 230731         SORT GROUP BY
 230730          TABLE ACCESS FULL $table_name

 ********************************************************************************

select "rowid" from
(select "rowid", rank_n from
(select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid"
from $schema.$table_name
)
)
where rank_n > :"SYS_B_0"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.03          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      157      7.67      15.47       2635       1257         15        2337
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      159      7.67      15.50       2635       1257         15        2337

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

********************************************************************************

SELECT MIN(rowid)
FROM $schema.$table_name
GROUP BY tournament_id, nickname, bank, place, prize_id, userid, token_id
having count(*) > :"SYS_B_0"

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.00       0.00          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch      315     16.74      41.81       7832       3897         53        4662
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      321     16.74      41.81       7832       3897         53        4662

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  FILTER
 230731   SORT GROUP BY
 230730    TABLE ACCESS FULL $table_name

********************************************************************************

SELECT
 rowid
FROM
   $schema.$table_name A
WHERE
   rowid >
     (SELECT min(rowid) FROM $schema.$table_name B
      WHERE
         B.tournament_id = A.tournament_id
      and B.nickname = A.nickname
      and B.bank = A.bank
      and B.place = A.place
      and B.prize_id = A.prize_id
      and B.userid = A.userid
      and B.token_id = A.token_id
      )

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.02       0.01          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch        3    153.83     213.12        933     436118       2104           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        9    153.85     213.13        933     436118       2104           0

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  FILTER
     30   TABLE ACCESS FULL $table_name
     58   SORT AGGREGATE
      3    TABLE ACCESS FULL $table_name

5 Responses to “Oracle: Deleting Duplicate Rows Efficiently”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>