Hearing the Oracle

Home » Featured » De-duplicating Rows

De-duplicating Rows

dedup them!

dedup these hay-bogarting heffers for me!

Often the need arises to locate and remove duplicated rows, or partially duplicated rows from a table. Queries using the DISTINCT keyword (or it’s synonym UNIQUE) will retrieve and display rows without duplicates. But actually identifying and then removing the unwanted duplicates within a table represents a different level of work, and is also likely to involve greater resource consumption.

Before getting to de-duping, let’s generate a reasonable amount of test data with duplications. We’ll use the table below. Keep in mind that a duplicate record for business purposes need not necessarily imply an exact match of every column between two rows. Often, some columns will be considered ancillary to the de-duping exercise; they serve just as filler data, and can sometimes be useful to distinguish “duplicates”. In our table, the two columns ALPHA_CODE and NUM_CODE will serve as our distinguishing marker for records. Only when two separate rows contain identical values for these two columns will they be considered duplicate rows.
w h i t e s p a c e

  rob@XE_11g2> CREATE TABLE test_dedup1
  (alpha_code VARCHAR2(2) NOT NULL,
   num_code INTEGER NOT NULL,
   batch_id INTEGER,
   timestamp DATE DEFAULT sysdate);

  Table Created.

w h i t e s p a c e
Our ALPHA_CODE values will be any two uppercase letters, giving 676 different unique candidates. Similarly, for NUM_CODE we’ll permit any integer between 1 and 676, inclusive. This will allow (26^4) or 456976 possible distinct rows. So, in order to guarantee duplicates, we will need a process which generates, at random, say 460,000 different rows, which would yield at least 3024 duplicates. The following PL/SQL snippet calling the supplied package DBMS_RANDOM, will do the trick. It’s preceded by a single sample run of the relevant code in a SQL*Plus session which displays one test row of data.
w h i t e s p a c e

   rob@XE_11g2> SELECT
   CHR(65 + FLOOR(dbms_random.value(0,26)))||
      CHR(65 + FLOOR(dbms_random.value(0,26))) AS alpha,
   CEIL(dbms_random.value(0,676)) AS num,
   ROUND(dbms_random.value(2000, 2013),0) AS batch,
   TO_DATE(sysdate, ‘DD-MON-YYYY’) + dbms_random.value(-31,-1) AS timestamp
   FROM dual;

    ALPHA         NUM     BATCH TIMESTAMP
   ======== ========= ========= =========
   ZB             392      2007 03-FEB-13

w h i t e s p a c e

   rob@XE_11g2> CREATE OR REPLACE PROCEDURE gendata_4dedup
   IS
   BEGIN
   EXECUTE IMMEDIATE ‘ALTER TABLE test_dedup1 NOLOGGING’;
   - - Outer Loop 460 times
   FOR i IN 1 .. 460
   LOOP
      - - Inner Loop 1000 generated rows
      FOR j IN 1 .. 1000
      LOOP
         INSERT INTO test_dedup1 VALUES
         (
         CHR(65 + FLOOR(dbms_random.value(0,26)))||CHR(65 + FLOOR(dbms_random.value(0,26))),
         CEIL(dbms_random.value(0,676)),
         ROUND(dbms_random.value(2000, 2013),0),
         TO_DATE(sysdate, ‘DD-MON-YYYY’) + dbms_random.value(-31,-1)
         );
      END LOOP;
      COMMIT;
   END LOOP;
   END;
   /

   Procedure created.

w h i t e s p a c e

The procedure, which involved about 2.3 million DBMS_RANDOM calls and half a million row inserts, executed in a little under 45 seconds on my laptop database, while I was also browsing online. I verified the rowcount: 460,000. Here’s a look at the first 100 or so randomly-generated rows of our table.

                                                                                                                                       view generated table 

Basic Dedup Methods

Before getting to the actual removal of duplicates, let’s examine some characteristics of the generated data. Each of the 676 possible ALPHA_CODE values as well as each of the 676 NUM_CODE values are well represented within the set of generated test data. This can be seen in the following distribution results.

                                                                                                                view ALPHA_CODE distribution 
                                                                                                                     view NUM_CODE distribution 
w h i t e s p a c e


  rob@XE:11g2> SELECT COUNT(UNIQUE alpha_code||num_code) FROM test_dedup1;
  COUNT(UNIQUEALPHA_CODE||NUM_CODE)
  =================================
                             289739

  rob@XE:11g2> SELECT 289739/(26*26*26*26) FROM dual;
   289739/(26*26*26*26)
  ====================
            .634035485

w h i t e s p a c e
We should not assume, however, that each of the concatenated code possibilities (the product of 676×676) have been generated. That’s because we’ve only generated slightly more than the 456976 possible distinct amount of rows; and in a typical probabilistic distribution, we’ve not yet had enough rolls of the dice to approach 100% coverage. In fact, during a number of dice rolls which corresponds to slightly less than 101% of our potential quantity, we’ve generated about 63% of the distinct possible value pairs. Bottom line: we’re going to have many duplicates, about 170K in fact! The query below looks at the meta-distribution of ALPHA_CODE values in the generated set. Whereas an average frequency would be exactly 676 for any given ALPHA_CODE, one value occurs 760 times (a maximum) and another occurs only 595 times (a minimum).
w h i t e s p a c e


   WITH count1 AS
     (SELECT COUNT(*) AS alpha_dist, alpha_code
      FROM test_dedup1
      GROUP BY alpha_code)
   SELECT
      COUNT(*) AS Meta-Count, alpha_dist
      FROM count1
      GROUP BY alpha_dist
      ORDER BY 1 DESC, 2 DESC
   /

w h i t e s p a c e

                                                                                                                                            view query results 

Now we come to the actual deleting of duplicates. We shall explore three methods. Because we have ancillary column data which we have good certitude to contain different values even when their ALPHA_CODE and NUM_CODE values are identical, we can use them to help find and remove duplicates. (Note that in all cases below, I make a second copy of the random data table so subsequent methods can be applied freshly. I also build indexes to facilitate the deduping.)
w h i t e s p a c e


   DELETE FROM test_dedup1
   WHERE (alpha_code, num_code) IN
   (
      SELECT alpha_code, num_code FROM test_dedup1
      GROUP BY alpha_code, num_code HAVING COUNT(*) > 1
   )
   AND (timestamp, alpha_code, num_code) NOT IN
   (
      SELECT MAX(timestamp), alpha_code, num_code FROM test_dedup1
      GROUP BY alpha_code, num_code HAVING COUNT(*) > 1
   )
   /

w h i t e s p a c e
                                                                                                                                    view Dedup session #1 

Using ROWIDs

Every Oracle database table is equipped with several pseudocolumns — columns containing system-generated data for internal use or to fulfill special queries. Here’s an overview of database pseudocolumns, and in case you’re interested, here’s a more comprehensive look. The one which matters for our present purposes is ROWID, which is a unique database-wide 18-byte identifier for any row in any table. Without relying upon any columns besides those directly relevant to duplication, we can use a simple self-join of the table along with the ROWID to perform the deletions.
w h i t e s p a c e


   DELETE FROM test_dedup1 X
   WHERE X.rowid <>
      (SELECT MAX(Y.rowid) FROM test_dedup1 Y
       WHERE Y.alpha_code = X.alpha_code
       AND Y.num_code = X.num_code)
   /

w h i t e s p a c e
                                                                                                                                    view Dedup session #2 

Using Analytic Functions

Yet another way to isolate duplicate rows for removal is with Oracle’s Analytic functions. Both ROW_NUMBER() and RANK() might be plausible, but given our situation there is no need to impose the extra processing implied by RANK (which would have to sort on something explicit within the partition window), so we’ll use ROW_NUMBER. If you want an intro to SQL Analytic Functions, read this. Really we are still relying upon the ROWID pseudocolumn with this method, as directly above. But in this case we are letting analytic syntax describe the grouped sets of duplicates.
w h i t e s p a c e


   DELETE FROM test_dedup1
   WHERE ROWID IN
       (SELECT bad_clone FROM
           (SELECT
                ROWID AS bad_clone,
                ROW_NUMBER() OVER
                   (PARTITION BY alpha_code||num_code ORDER BY alpha_code||num_code)
                    AS sibling_id
             FROM test_dedup1)
         WHERE sibling_id > 1)
   /

w h i t e s p a c e
                                                                                                                                    view Dedup session #3 

And the Winner Is…

We’ve now seen three different methods for removing duplicated rows within the same table. Which was the most efficient? For each of them I used a fresh copy of the original table, as can be seen in the spooled session results. In all cases I tried two different indexing schemes to support the deduping: 1) two single-column indexes on ALPHA_CODE and NUM_CODE; 2) one concatenated index built on (ALPHA_CODE, NUM_CODE). The results were consistent: in each case 170,261 duplicate rows were deleted, and a check was performed showing that no duplicates remained. Conclusion: Oracle is faster when able to use ROWIDs, independently of whether Analytic functions are used or not. The execution times, via the SET TIMING ON command, were as follows:
w h i t e s p a c e

      IN / NOT IN Method       ROWID Method       Analytic Method
  2 Single Indexes:        13 min 03 secs   2 Single Indexes:        22 secs   2 Single Indexes:        18 secs
  Concatenated Index:  17 min 52 secs   Concatenated Index:  13 secs   Concatenated Index:  14 secs

                                                                                                                                        compare trace plans 
w h i t e s p a c e

yeah, baby.

yeah, baby.

You might wonder why these deduplication trials were conducted using varying sets of single column and concatenated indexes. The results varied depending upon the method chosen. For the first method, a single concatenated index was consistently 30% slower than using two single-column indexes. For the ROWID method, this result reversed. This raises some questions about partial indexes, leading columns and how to choose them while balancing SELECT, INSERT, UPDATE and DELETE demands. Which sounds like a good topic for a later post.

~RS

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: