Hearing the Oracle

Home » Featured » Unpivoting Spreadsheets into Oracle

Unpivoting Spreadsheets into Oracle

Enter email address to receive notifications of new posts by email.

Categories

phingeThe premise is as follows: You’ve got useful data lying in an Excel or Google spreadsheet which you’d like translated into relational table(s). This is likely going to involve normalizing the denormalized spreadsheet data on the way into the database; you are going to end up with (and want) more rows in your table than the corresponding quantity within the spreadsheet. The SQL tool for this is the UNPIVOT keyword. Previous posts have looked at basic and more complex examples of the PIVOT command, new to Oracle 11g syntax. The gist of pivoting is that you want to widen display rows out by converting row data into extra column values for a single grouping row. The original table’s rows are likely in normalized form and the query output rows are hence denormalized. Unpivoting reverses this, and a typical need to do so arises when we want to grab spreadsheet data into the database.
w h i t e s p a c e

A Sample Spreadsheet

Let’s suppose we have a spreadsheet detailing official NFL (American) football scores for the recently concluded 2012 season. Each row in the spreadsheet contains scores and info for one of the 32 teams in the league. The first two columns in each row contains the team and it’s division. These are followed by 16 sets of three columns describing game results. The three columns are the opponent team, the points scored by the team, and the points scored by the opponent. When the opponent team name is preceded by an ‘@’ character, it indicates an away game. To get a normalized table supporting relational queries, we’d want to to convert each spreadsheet row into 16 table rows, one per football game.
                                                                                                                                              view spreadsheet 

Methods for Extracting Spreadsheets

There are lots of ways to achieve connectivity between spreadsheets and Oracle. For Excel, there are explicit 3rd-party tools available to automate the process. If your Oracle database license includes the Data Mining Option, then you will already have a GUI wizard to step you through this ‘BI’ process. SQL Developer also includes a visual spreadsheet interface function. We’re going to directly use the under-the-covers method, which involves .CSV files and external tables.

CSV files are comma-delimited text files, commonly used as the default format for datafiles containing records separated into fields. Within Excel, you can output the entire spreadsheet or any rectangular block of rows and columns into a CSV file via the File/Save As function. The same is true for an Open Office (open source) spreadsheet. And within a GoogleDocs spreadsheet, use: File/Download As/Comma Separated Values. I have performed this step for the spreadsheet above and here’s what the resulting CSV file looks like:
                                                                                                                                                       view CSV file 

The session below illustrates making the CSV text file available to Oracle as an external table.
w h i t e s p a c e

  rob@XE:11g2> CREATE TABLE ext_nfl2012_scores
   (nfl_team VARCHAR2(4), divison VARCHAR2(10),
    opponent1 VARCHAR2(4), points_for1 NUMBER, points_against1 NUMBER,
    opponent2 VARCHAR2(4), points_for2 NUMBER, points_against2 NUMBER,
    opponent3 VARCHAR2(4), points_for3 NUMBER, points_against3 NUMBER,
    . . .
    opponent15 VARCHAR2(4), points_for15 NUMBER, points_against15 NUMBER,
    opponent16 VARCHAR2(4), points_for16 NUMBER, points_against16 NUMBER
   )
    ORGANIZATION EXTERNAL
       (DEFAULT DIRECTORY extdir1
        ACCESS PARAMETERS
           (RECORDS DELIMITED BY NEWLINE
            SKIP 1
            FIELDS TERMINATED BY ','
            MISSING FIELD VALUES ARE NULL
           )
        LOCATION ('nfl2012_csv.txt')
       );

  Table created.
  rob@XE:11g2>
SELECT COUNT(*) FROM ext_nfl2012_scores;
     COUNT(*)
  ==========
          32

w h i t e s p a c e

The reason for the SKIP 1 directive is that we want to suspend datapump processing for the first record in the CSV file which is column header information from the spreadsheet.
w h i t e s p a c e

Unpivot SQL

Now we’ve got the spreadsheet data available to Oracle, but external tables have limitations. You cannot create indexes against them or execute any DML on them. To make the data completely flexible we will transfer it into a conventional database table and use UNPIVOT to structure the columns to our liking. The following SQL will process the data such that a separate row is created for each game result. Note that we’re doing a multicolumn unpivot operation here. Sets of three columns (OPPONENT, POINTS_FOR, POINTS_AGAINST) are being unpivoted for distinct groups of (TEAM, DIV).
w h i t e s p a c e


   CREATE TABLE nfl2012_scores AS
   SELECT
      team,
      SUBSTR(division,1,4) AS div,
      game,
      opponent AS opp,
      DECODE(SUBSTR(opponent,1,1), '@','AWAY', 'HOME') AS venue,
      ptsf,
      ptsa,
      (CASE
          WHEN ptsf > ptsa THEN 'WIN'
          WHEN ptsf < ptsa THEN 'LOSS'
          WHEN ptsf = ptsa THEN 'TIE'
      END) wlt
   FROM
   (
      SELECT * FROM ext_nfl2012_scores
      UNPIVOT
      (
         (opponent, ptsf, ptsa) FOR game IN
         (
            (opponent1, points_for1, points_against1) AS 'wk1',
            (opponent2, points_for2, points_against2) AS 'wk2',
            (opponent3, points_for3, points_against3) AS 'wk3',
            .  .  .
            (opponent15, points_for15, points_against15) AS 'wk15',
            (opponent16, points_for16, points_against16) AS 'wk16'
         )
      )
   );

w h i t e s p a c e

                                                                                                                                              view SQL session 

Sidenote about Virtual Columns: Release 11g has a new feature which allows computed column data to not be stored in the datafiles, rather computed at runtime whenever asked for. This feature, called virtual columns, could save serious space in Data Warehousing scenarios, and carries less internal overhead than views do. I tried applying this technique for the two computed columns in the UNPIVOT example above, (for HOMEAWAY and WLT), but received an ORA-00923 error. Oracle will not allow the on-the-fly DML activity during the SELECT. But you can get around this by adding the computed columns after the table is created. If interested in this technique, have a look at this session listing:

                                                                                                                         see Virtual Column session 

With the original spreadsheet data now normalized in a database table, we are free to issue queries such as the following four. The first is simply a typical year-end standings as you might see in a sports newspaper. The final three, for you NFL stat nerds out there, are tabulations of the most commonly scored point totals by any team during 2012 and the most common margins of victory or point differentials, and finally, a comparison of ‘blowout’ games and ‘tight’ games during the season. A blowout is arbitrarily defined as a game in which the margin of victory or defeat is at least 20 points, while for a close game this same margin must be at most 7 points. All of these queries would have been implausible prior to unpivoting.
w h i t e s p a c e


   /* Ranked team standings by division */
   SELECT
      team, MAX(div) AS div,
      SUM(DECODE(wlt, 'WIN',1.0, 'LOSS',0, 'TIE',0.5, 0)) wins,
      SUM(ptsf) AS pointsf,
      SUM(ptsa) AS pointsa,
      SUM(ptsf)-SUM(ptsa) AS pt_diff
   FROM nfl2012_scores
   GROUP BY team
   ORDER BY 2, 3 DESC, 4 DESC;

w h i t e s p a c e


   /* Most common points per game scored by a team */
   SELECT ptsf AS points_per_game, COUNT(*) AS frequency
   FROM nfl2012_scores
   GROUP BY ptsf HAVING COUNT(*) > 1
   ORDER BY 2 DESC;

w h i t e s p a c e


   /* Most common margins of victory in games */
   SELECT ABS(ptsf-ptsa) AS point_differential, COUNT(*) AS frequency
   FROM nfl2012_scores
   WHERE venue = 'HOME'
   GROUP BY ABS(ptsf-ptsa) HAVING COUNT(*) > 1
   ORDER BY 2 DESC;

w h i t e s p a c e


   /* Comparison of blowouts and close games by team */
   WITH blowouts AS
      (SELECT team,
       SUM(DECODE(wlt, 'WIN', 1, 0)) AS Blowout_W,
       SUM(DECODE(wlt, 'LOSS', 1, 0)) AS Blowout_L
       FROM nfl2012_scores  WHERE ABS(ptsf-ptsa) >= 20
       GROUP BY team),
   close_games AS
      (SELECT team,
       SUM(DECODE(wlt, 'WIN', 1, 'TIE', .5, 0)) AS Close_W,
       SUM(DECODE(wlt, 'LOSS', 1, 'TIE', .5, 0)) AS Close_L
       FROM nfl2012_scores  WHERE ABS(ptsf-ptsa) <= 7
       GROUP BY team)
   SELECT
      C.team, Blowout_W, Blowout_L,
      Blowout_W / (Blowout_w + Blowout_L) AS “Blowout_Pct”,
      Close_W, Close_L,
      Close_W / (Close_W + Close_L) AS “Close_Pct”
   FROM
      blowouts B, close_games C
   WHERE B.team(+) = C.team
   ORDER BY 2 DESC, 3 ASC, 7 DESC;

w h i t e s p a c e

Note: In the final query’s results, Pittsburgh (PIT) is the only NFL team not to have been involved in a blowout game during the 2012 season. This is the reason an outer join, (+), was used in the SQL; otherwise PIT would not have displayed at all in the output. If you’re interested in NFL stats, relating to SQL queries and also to graphic presentations, you might enjoy reading this post, which further explores Oracle and R with NFL data.

                                                                                                                                            view query results 

The XML Alternative?

LTAnother approach to spreadsheet ingestion one could imagine would be using XML. Some spreadsheets permit this export format more easily or directly than others. It would mean producing an XML data stream from the original spreadsheet, then using the XMLtype datatype for the external table. The advantage would be not having to explicitly specify (or even know in advance) column names. The XML variant of the PIVOT command includes an ANY keyword which facilitates this. I see no equivalent of the ANY keyword option with unpivot operations in Oracle SQL, however. You can review Oracle’s SQL base syntax for UNPIVOT here, and an expansion of the dependent unpivot_in_clause here. Perhaps a future release will contain an UNPIVOT XML capability, or maybe there are structural reasons why this is unworkable. If you’ve got any insights concerning this, I’d like to hear about it.

~RS


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: