Hearing the Oracle

Home » Featured » Exploring Analytic Functions : an Iterative Approach

Exploring Analytic Functions : an Iterative Approach

Lake Malawi

Lake Malawi

Suppose some quixotic HR evangelist fresh off a stint of social aid work in Malawi is pushing the bombastically non-Libertarian idea of normalizing compensation patterns within departments as follows: employees who have high tenure seniority contrasted with low salary ‘seniority’ will receive pay adjustments. This is the sort of question that can be explored with Oracle’s analytic SQL functions.

If you inspect the available analytic functions, you’ll note that many of them deal with mathematical statistics, but not all. Analytic functions are like the aggregate functions, in that they operate upon a group of rows. But they are different in that they do not return just one row. Each analytic function can specify it’s own scope, varying from the entire report to any grouping definable by the flexible syntax provided within the OVER clause, also known as the analytic clause. This independent scoping for each analytic column in a query is what really distinguishes the capabilities of analytic functions. We’ll take a stepwise approach to exposing some of these functions and building SQL queries towards the desired result. Syntax:

         analytic_function([ arguments ]) OVER (analytic_clause)

         analytic_clause:
             [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ]

whitespace

Ranking Functions

For our tenure/pay discrepancy problem, we first need to explore the data to survey the differences between HIRE_DATE and SALARY. Several ranking functions will come in handy. Within the OVER clauses below, the PARTITION BY keyword defines the ranking groups as departments, because we want to consider our salary adjustments based upon tenure/pay discrepancies within departments, not company-wide.
w h i t e s p a c e


   SELECT
      department_id AS dept,
      first_name||’ ‘||last_name AS ename,
      salary,
      RANK() OVER
         (PARTITION BY department_id ORDER BY salary DESC) AS payrank,
      DENSE_RANK() OVER
         (PARTITION BY department_id ORDER BY salary DESC) AS dpayrank,
      PERCENT_RANK() OVER
         (PARTITION BY department_id ORDER BY salary ASC) AS pctrank,
      CUME_DIST() OVER
         (PARTITION BY department_id ORDER BY salary ASC) AS cumedist,
   FROM employees
   /

w h i t e s p a c e

                                                                                                                                              view query results 

The difference between RANK and DENSE_RANK is that DENSE_RANK forces strictly ordinal numbers after ranking ties, while RANK skips numbers after ties. Highlighted output for Dept. 60 illustrates this. PERCENT_RANK shows the percentage of employees within the grouping (dept) with lower-ranked salaries. CUME_DIST is very similar, but includes the item itself (ranked employee) in the percentage calculation. Again, inspect Dept. 60 to see the difference. We’ll select DENSE_RANK and PERCENT_RANK for our purposes. Note that some departments are too small, so we’ll impose a minimum eligibility size of 4 staff members. We’ll also curtail the accuracy of the decimal displays to save space. Finally, we’ll add in the tenure information and take an initial stab at building a column which compares seniority and salary rankings.
w h i t e s p a c e


   SELECT
      department_id AS dept,
      first_name||’ ‘||last_name AS ename,
      salary,
      DENSE_RANK() OVER
         (PARTITION BY department_id ORDER BY salary DESC) AS srank,
      100*PERCENT_RANK() OVER
         (PARTITION BY department_id ORDER BY salary ASC) AS psrank,
      sysdate – hire_date AS tenure,
      DENSE_RANK() OVER
         (PARTITION BY department_id ORDER BY hire_date ASC) AS trank,
      100*PERCENT_RANK() OVER
         (PARTITION BY department_id ORDER BY hire_date DESC) AS ptrank,
      100*PERCENT_RANK() OVER (PARTITION BY department_id ORDER BY hire_date DESC)
         - 100*PERCENT_RANK() OVER (PARTITION BY department_id ORDER BY salary ASC)
         AS skew
   FROM employees
   WHERE department_id IN
      (SELECT department_id FROM
         (SELECT COUNT(*), department_id FROM employees
          GROUP BY department_id HAVING COUNT(*) > 3)
      )
   ORDER BY 9 DESC, 1 ASC
   /

w h i t e s p a c e
                                                                                                                                              view query results 

This gives some usable results. We can see from the sorted SKEW column that 12 employees have effectively equal percent ranks for tenure and salary (middle of report). Only employees above this cluster have tenure rankings which are lower, within their departments, than their salary rankings. The top two rows, Austin and Marlow, show extreme ‘skew’: quite high tenure coupled with relatively low salary. Since we are only interested in addressing relatively skewed cases, we will apply the following business rule:

         • Ignore all skews below 25
         • Award a 1% increase to all employees if skew >= 25
         • Award an additional 1% increase (total 2%) if skew >= 50
         • Award an additional 0.5% increase (total 2.5%) if skew >= 75

Before proceeding however, let’s check an alternative method for ranking employees, using NTILEs. Any ranked group, including the entire report, can be chunked into quartiles or quintiles, or N-tiles. We have a few smallish departments in our test data, so it makes sense to look for a happy medium when it comes to tile sizes. Let’s try splitting into three and four groups, respectively, and to make things interesting, let’s limit our study to the smaller departments. We could then consider an alternative adjustment rule such as this:

         • Define skew as salary quartile minus tenure quartile
         • Award a 1% increase to all employees if skew = 2
         • Award an additional 1% increase (total 2%) if skew = 3

w h i t e s p a c e


   SELECT
      department_id, last_name, salary,
      NTILE(4) OVER (PARTITION BY department_id ORDER BY salary DESC) AS s4tile,
      NTILE(3) OVER (PARTITION BY department_id ORDER BY salary DESC) AS s3tile,
      sysdate-hire_date AS tenure,
      NTILE(4) OVER (PARTITION BY department_id ORDER BY hire_date ASC) AS t4tile,
      NTILE(3) OVER (PARTITION BY department_id ORDER BY hire_date ASC) AS t3tile
   FROM employees
   WHERE department_id IN
      (SELECT department_id FROM
         (SELECT COUNT(*), department_id
          FROM employees
          GROUP BY department_id
          HAVING COUNT(*) BETWEEN 4 AND 8)
      )
   ORDER BY 1 ASC, 3 DESC, 6 DESC
   /

w h i t e s p a c e
                                                                                                                                              view query results 

It looks like there’s a granularity problem with quartiles (and tretiles) for our sample data, which becomes clearer with small departments (30, 60, 100) of six people or less. If you compare employees Raphaely and Khoo (highlighted), you’ll see they’ve both been lumped into the 1st quartile despite Raphaely’s salary being more than three times greater than Khoo’s! The opposite issue is seen with tenure data for Hunold and Patabella: they are placed into adjacent tenure quartiles although they have the two closest HIRE_DATE values in their department.

This argues for choosing the PERCENT_RANK skew measure method seen previously. It has the right granularity for the group size, or PARTITION in analytic function parlance (not to be confused with data partitions for very large tables and indexes).

Adjusting the Skew

But the following query shows PERCENT_RANK is not perfect either, just better than NTILE. It’s nearly identical with the previous one, focusing upon the same 17 employees. Look at the percentile ranking for Raphaely and Khoo; you’ll see that although the numbers have improved, Khoo, with nearly $80K less in salary is still within 20% of Raphaely.

                                                                                                                                              view query results 

I included two new columns, JOB_ID and a flag to indicate management status. In our case the adjacent large jumps in salary within departments has to do with these two factors. It’s easy enough to alter our strategy by excluding managers from our salary adjustment study, or alter the study to group people by job titles within departments instead of simply by departments; preferably both. But a quick discussion of our findings with the HR officer results in the spec changing: group employees by department and job title combined, and formulate salary/tenure skews on that basis.

This decision leaves the gap between Hunold and Ernst in Dept. 60 as the only grey area. (Review previous query results.) They share the same job titles, but Hunold is in management, and makes about $50K more. It’s not a problem for our needs however, because Ernst has the lowest tenure ranking in his department and job title class, thus will not be in consideration for this pay adjustment.

Pressing forward, we amend the query as follows:
w h i t e s p a c e


   SELECT
      department_id||job_id AS deptjob,
      last_name,
      salary,
      sysdate – hire_date AS tenure,
      100*PERCENT_RANK() OVER
         (PARTITION BY department_id||job_id ORDER BY salary ASC) AS sprank,
      100*PERCENT_RANK() OVER
         (PARTITION BY department_id||job_id ORDER BY hire_date DESC) AS tprank,
      100*PERCENT_RANK() OVER (PARTITION BY department_id||job_id ORDER BY hire_date DESC)
         - 100*PERCENT_RANK() OVER (PARTITION BY department_id||job_id ORDER BY salary ASC)
         AS prskew
   FROM employees
   WHERE department_id||job_id IN
      (SELECT deptjob FROM
         (SELECT COUNT(*), department_id||job_id AS deptjob
          FROM employees
          GROUP BY department_id||job_id
          HAVING COUNT(*) > 3 )
      )
   ORDER BY 1 ASC, 7 DESC
   /

w h i t e s p a c e
                                                                                                                                              view query results 

Within department-jobs, coded via the concatenated string construct DEPARTMENT_ID||JOB_TITLE, employees are sorted by decreasing percent rank skews between their tenures and salaries. The TENURE column number unit is days. Recalling that our cutoff skew is going to be a minimum 25.0 point differential, we see that only 11 employees will be affected by this pay raise. Now, we’re ready for the final query which lists the adjustments for each affected employee.
w h i t e s p a c e


   column adjustment format $99999.99;
   WITH skewed_pay AS
   (
      SELECT
         department_id||job_id AS deptjob,
         first_name||’ ‘||last_name AS ename,
         salary,
         100*PERCENT_RANK() OVER
            (PARTITION BY department_id||job_id ORDER BY salary ASC) AS sprank,
         sysdate – hire_date AS tenure,
         100*PERCENT_RANK() OVER
            (PARTITION BY department_id||job_id ORDER BY hire_date DESC) AS tprank
         FROM employees
         WHERE department_id||job_id IN
            (SELECT deptjob FROM
               (SELECT COUNT(*), department_id||job_id AS deptjob FROM employees
                GROUP BY department_id||job_id HAVING COUNT(*) > 3 )
            )
   )
   SELECT
      deptjob, ename, salary, tenure,
      tprank – sprank AS skew,
      (CASE
         WHEN tprank-sprank >= 25.0 AND tprank-sprank < 50.0 THEN .01*salary
         WHEN tprank-sprank >= 50.0 AND tprank-sprank < 75.0 THEN .02*salary
         WHEN tprank-sprank >= 75.0 THEN .025*salary
         ELSE 0
      END) adjustment
   FROM skewed_pay
   WHERE tprank – sprank >= 25.0
   ORDER BY 5 DESC
   /

w h i t e s p a c e
                                                                                                                                              view query results 

The Windowing Clause

Let’s summarize how the basic processing for Analytic SQL statements work. The analytic functions are not dealt with until after the basic query has been processed. Once this result set is obtained, the analytic partitions and functions are applied against it as a second pass. Afterwards, any final report-wide sorting is performed.

how analytic queries are processed

how analytic queries are processed

We’ve thus far only looked at simple analytic clauses. They can get much more complex and look like pretty strange SQL with all sorts of extra reserved words. I’ll introduce just two slightly more complex items which utilize windows of related rows. In analytic function syntax, the windowing clause is an optional element after the PARTITION BY and ORDER BY clauses of the full analytic clause. They return a value for each row in the output, which depends on other rows within the corresponding window. With them you can compute and display cumulative, moving, and centered aggregates. The last two are so-called sliding windows which I’ll cover in a subsequent post.

For now, let’s consider that our HR officer is happy with the final query results, except for needing a little more info about how department budgets will be affected. We need to summarize the pay adjustment impacts by department and determine which percentages of the overall adjustment are going to which staff members. We can satisfy the first need with a cumulative windowed aggregate and the second with the RATIO_TO_REPORT analytic function.

Here’s what the completed query will look like, taking just the final SELECT. (To see the entirety, including both named WITH subqueries, click the button below.) The output is also shown.
w h i t e s p a c e


   WITH
   skewed_pay AS
   (
      SELECT. . .
   ),
   adjust_calc AS
   (
      SELECT. . .
   )
   SELECT
      department_id AS dept, job_id, ename, salary, skew, adjustment,
      TO_CHAR(100.00*RATIO_TO_REPORT(adjustment) OVER (), ‘99.99’) AS R_to_R,
      SUM(adjustment) OVER
         (PARTITION BY department_id ORDER BY department_id ASC
         ROWS UNBOUNDED PRECEDING) AS dept_cum
   FROM adjust_calc
   ORDER BY 1 ASC, 4 DESC
   /

w h i t e s p a c e
                                                                                                                                                  view final report 

   hr@XE_11g2> r
    Dept. JOB_ID     ENAME               SALARY   SKEW ADJUSTMENT R_TO_R   DEPT_CUM
   ===== ========== ================== ======= ====== ========== ====== ==========
      30 PU_CLERK   Sigal Tobias         27776   25.0    $277.76   2.79    $277.76
      50 ST_MAN     Payam Kaufling       78368   50.0   $1567.36  15.73   $1567.36
      50 ST_CLERK   James Marlow         24800   52.6    $496.00   4.98   $2063.36
      60 IT_PROG    Valli Pataballa      76186   25.0    $761.86   7.65    $761.86
      60 IT_PROG    David Austin         76186   75.0   $1904.64  19.12   $2666.50
      60 IT_PROG    Diana Lorentz        66662   25.0    $666.62   6.69   $3333.12
      80 SA_REP     Patrick Sully        94240   28.6    $942.40   9.46    $942.40
      80 SA_REP     Allan McEwen         89280   28.6    $892.80   8.96   $1835.20
      80 SA_REP     Lindsey Smith        79360   39.3    $793.60   7.96   $2628.80
      80 SA_REP     Louise Doran         74400   25.0    $744.00   7.47   $3372.80
     100 FI_ACCOUNT Ismael Sciarra       91661   25.0    $916.61   9.20    $916.61
   11 rows selected.

Aggregation Scopes (Partition Windows):      entire report              departments              dept + job_id       
w h i t e s p a c e

A couple of things to notice… For the RATIO_TO_REPORT function, the analytic partition is the entire report, so the syntax collapses to: OVER (). In other words, the PARTITION keyword is unneeded. The meaning of the 3rd row, as an example, is that employee Marlow’s adjustment of $496 per annum represents 4.98% of the entire adjustment total being awarded company-wide. The DEPT_CUM column uses the SUM function applied cumulatively within every department, as indicated by it’s PARTITION clause. The cumulative summing, downwards, is accomplished via the ROWS UNBOUNDED PRECEDING keywords. So, for example, Diana Lorentz is the final employee in the sorting sequence for Dept. 60, thus her adjustment of $666.62 is summed with the previous running department total, yielding a DEPT_CUM value of $3333.12 for IT. The power and flexibility of analytic SQL is seen in that three different aggregation scopes are operating simultaneously in this query output. The column R_TO_R works against the entire report; the DEPT_CUM column applies within department breaks; and the SKEW and ADJUSTMENT columns, which are based upon the hidden underlying PERCENT_RANK columns (in the WITH AS clauses) work against a concatenation of department and JOB_ID breaks.

This tour only touches the surface of what can be done with SQL analytic queries. But it gives an idea of how they work and introduces several functions. At a later time, I’ll delve into more complicated examples.

~RS

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: