Before getting to de-duping, let’s generate a reasonable amount of test data with duplications. We’ll use the table below. Keep in mind that a duplicate record for business purposes need not necessarily imply an exact match of every column between two rows. Often, some columns will be considered ancillary to the de-duping exercise; they serve just as filler data, and can sometimes be useful to distinguish “duplicates”. In our table, the two columns ALPHA_CODE and NUM_CODE will serve as our distinguishing marker for records. Only when two separate rows contain identical values for these two columns will they be considered duplicate rows.
w h i t e s p a c e
(alpha_code VARCHAR2(2) NOT NULL,
num_code INTEGER NOT NULL,
batch_id INTEGER,
timestamp DATE DEFAULT sysdate);
Table Created.
w h i t e s p a c e
Our ALPHA_CODE values will be any two uppercase letters, giving 676 different unique candidates. Similarly, for NUM_CODE we’ll permit any integer between 1 and 676, inclusive. This will allow (26^4) or 456976 possible distinct rows. So, in order to guarantee duplicates, we will need a process which generates, at random, say 460,000 different rows, which would yield at least 3024 duplicates. The following PL/SQL snippet calling the supplied package DBMS_RANDOM, will do the trick. It’s preceded by a single sample run of the relevant code in a SQL*Plus session which displays one test row of data.
w h i t e s p a c e
CHR(65 + FLOOR(dbms_random.value(0,26)))||
CHR(65 + FLOOR(dbms_random.value(0,26))) AS alpha,
CEIL(dbms_random.value(0,676)) AS num,
ROUND(dbms_random.value(2000, 2013),0) AS batch,
TO_DATE(sysdate, ‘DD-MON-YYYY’) + dbms_random.value(-31,-1) AS timestamp
FROM dual;
ALPHA NUM BATCH TIMESTAMP
======== ========= ========= =========
ZB 392 2007 03-FEB-13
w h i t e s p a c e
IS
BEGIN
EXECUTE IMMEDIATE ‘ALTER TABLE test_dedup1 NOLOGGING’;
- - Outer Loop 460 times
FOR i IN 1 .. 460
LOOP
- - Inner Loop 1000 generated rows
FOR j IN 1 .. 1000
LOOP
INSERT INTO test_dedup1 VALUES
(
CHR(65 + FLOOR(dbms_random.value(0,26)))||CHR(65 + FLOOR(dbms_random.value(0,26))),
CEIL(dbms_random.value(0,676)),
ROUND(dbms_random.value(2000, 2013),0),
TO_DATE(sysdate, ‘DD-MON-YYYY’) + dbms_random.value(-31,-1)
);
END LOOP;
COMMIT;
END LOOP;
END;
/
Procedure created.
w h i t e s p a c e
The procedure, which involved about 2.3 million DBMS_RANDOM calls and half a million row inserts, executed in a little under 45 seconds on my laptop database, while I was also browsing online. I verified the rowcount: 460,000. Here’s a look at the first 100 or so randomly-generated rows of our table.
Basic Dedup Methods
Before getting to the actual removal of duplicates, let’s examine some characteristics of the generated data. Each of the 676 possible ALPHA_CODE values as well as each of the 676 NUM_CODE values are well represented within the set of generated test data. This can be seen in the following distribution results.
view ALPHA_CODE distribution
view NUM_CODE distribution
w h i t e s p a c e
rob@XE:11g2> SELECT COUNT(UNIQUE alpha_code||num_code) FROM test_dedup1;
COUNT(UNIQUEALPHA_CODE||NUM_CODE)
=================================
289739
rob@XE:11g2>
SELECT 289739/(26*26*26*26) FROM dual;289739/(26*26*26*26)
====================
.634035485
w h i t e s p a c e
We should not assume, however, that each of the concatenated code possibilities (the product of 676×676) have been generated. That’s because we’ve only generated slightly more than the 456976 possible distinct amount of rows; and in a typical probabilistic distribution, we’ve not yet had enough rolls of the dice to approach 100% coverage. In fact, during a number of dice rolls which corresponds to slightly less than 101% of our potential quantity, we’ve generated about 63% of the distinct possible value pairs. Bottom line: we’re going to have many duplicates, about 170K in fact! The query below looks at the meta-distribution of ALPHA_CODE values in the generated set. Whereas an average frequency would be exactly 676 for any given ALPHA_CODE, one value occurs 760 times (a maximum) and another occurs only 595 times (a minimum).
w h i t e s p a c e
WITH count1 AS
(SELECT COUNT(*) AS alpha_dist, alpha_code
FROM test_dedup1
GROUP BY alpha_code)
SELECT
COUNT(*) AS Meta-Count, alpha_dist
FROM count1
GROUP BY alpha_dist
ORDER BY 1 DESC, 2 DESC
/
w h i t e s p a c e
Now we come to the actual deleting of duplicates. We shall explore three methods. Because we have ancillary column data which we have good certitude to contain different values even when their ALPHA_CODE and NUM_CODE values are identical, we can use them to help find and remove duplicates. (Note that in all cases below, I make a second copy of the random data table so subsequent methods can be applied freshly. I also build indexes to facilitate the deduping.)
w h i t e s p a c e
DELETE FROM test_dedup1
WHERE (alpha_code, num_code) IN
(
SELECT alpha_code, num_code FROM test_dedup1
GROUP BY alpha_code, num_code HAVING COUNT(*) > 1
)
AND (timestamp, alpha_code, num_code) NOT IN
(
SELECT MAX(timestamp), alpha_code, num_code FROM test_dedup1
GROUP BY alpha_code, num_code HAVING COUNT(*) > 1
)
/
w h i t e s p a c e
view Dedup session #1
Using ROWIDs
Every Oracle database table is equipped with several pseudocolumns — columns containing system-generated data for internal use or to fulfill special queries. Here’s an overview of database pseudocolumns, and in case you’re interested, here’s a more comprehensive look. The one which matters for our present purposes is ROWID, which is a unique database-wide 18-byte identifier for any row in any table. Without relying upon any columns besides those directly relevant to duplication, we can use a simple self-join of the table along with the ROWID to perform the deletions.
w h i t e s p a c e
DELETE FROM test_dedup1 X
WHERE X.rowid <>
(SELECT MAX(Y.rowid) FROM test_dedup1 Y
WHERE Y.alpha_code = X.alpha_code
AND Y.num_code = X.num_code)
/
w h i t e s p a c e
view Dedup session #2
Using Analytic Functions
Yet another way to isolate duplicate rows for removal is with Oracle’s Analytic functions. Both ROW_NUMBER() and RANK() might be plausible, but given our situation there is no need to impose the extra processing implied by RANK (which would have to sort on something explicit within the partition window), so we’ll use ROW_NUMBER. If you want an intro to SQL Analytic Functions, read this. Really we are still relying upon the ROWID pseudocolumn with this method, as directly above. But in this case we are letting analytic syntax describe the grouped sets of duplicates.
w h i t e s p a c e
DELETE FROM test_dedup1
WHERE ROWID IN
(SELECT bad_clone FROM
(SELECT
ROWID AS bad_clone,
ROW_NUMBER() OVER
(PARTITION BY alpha_code||num_code ORDER BY alpha_code||num_code)
AS sibling_id
FROM test_dedup1)
WHERE sibling_id > 1)
/
w h i t e s p a c e
view Dedup session #3
And the Winner Is…
We’ve now seen three different methods for removing duplicated rows within the same table. Which was the most efficient? For each of them I used a fresh copy of the original table, as can be seen in the spooled session results. In all cases I tried two different indexing schemes to support the deduping: 1) two single-column indexes on ALPHA_CODE and NUM_CODE; 2) one concatenated index built on (ALPHA_CODE, NUM_CODE). The results were consistent: in each case 170,261 duplicate rows were deleted, and a check was performed showing that no duplicates remained. Conclusion: Oracle is faster when able to use ROWIDs, independently of whether Analytic functions are used or not. The execution times, via the SET TIMING ON command, were as follows:
w h i t e s p a c e
IN / NOT IN Method | ROWID Method | Analytic Method |
---|---|---|
2 Single Indexes: 13 min 03 secs | 2 Single Indexes: 22 secs | 2 Single Indexes: 18 secs |
Concatenated Index: 17 min 52 secs | Concatenated Index: 13 secs | Concatenated Index: 14 secs |
compare trace plans
w h i t e s p a c e
~RS