Hearing the Oracle

Home » SQL Tips » Visualizing Oracle data: Tree Maps

Visualizing Oracle data: Tree Maps

tree of languages, indo-european

tree of languages, indo-european

I recently wrote about the R statistics and graphics platform, and introduced it as a way to do further analytics or visualizations for data from Oracle databases. This article will take a look at a graphical idea called tree maps. They are an extension of the more basic visualization technique: heat maps. Whereas heat maps use the idea of color intensity to express an extra dimension or quality about the data, tree maps do this in addition to expressing a hierarchy of quantity values.

These types of visualizations have become extremely popular; they can be seen all over the web. This fellow seems to have been a seminal figure and in any case has written a comprehensive historical overview of tree mapping, which he claims to have originated about 1990. Nowadays, everybody from Google Docs to cloud tool BI firms are clamoring for your tree map attention. Here’s a nice source for a test drive. We are going to continue using R for our investigation.

English Wordlists

We’ll continue to explore the dataset previously developed which contains the most common 76,000 English words. Suppose now that we wish to convey some deeper insights about this wordlist. The plot, barplot, and pie chart which we’ve seen all were used to display essentially 2-D data: word counts against alphabet letters. We could imagine adding hierarchical layers to the alphabet letter dimension. Let’s say we want to also capture word count data for all existing initial 2-letter combinations in English? (AB, AC, AD, AF, …) With a tree map graph, we can render this type of data spatially, within the larger category “A”, using inscribed rectangles. The relative size (area) of the rectangles will be proportional to their word counts.

example of a heat map

example of a heat map

This conveys a deeper level of linguistic analysis; you can think of the initial level (first letters only) as being the tree, and the second level (first two letters) as the branches. But we can go further, and describe a third level, the leaves. Suppose we wish to show not only how many English words begin with the letters ‘SC’, but also how many allowable unique 3rd letters exist in English for this group? The answer turns out to be 9: (SCA, SCE, SCH, SCI, SCL, SCO, SCR, SCU, SCY); and this quantity could be conveyed with the color intensity of the corresponding rectangle. You can think of it as a kind of measure of word-building flexibility for any English 2-letter starting sequence.

We need to return to Oracle to compute this data and produce a new CSV. Here’s the appropriate query. We will eliminate small clusters (less than 20 words) as well as all words beginning with Q, X, Y or Z to allow for a more readable tree map.

w h i t e s p a c e

      MAX(SUBSTR(UPPER(lexeme),1,1)) AS first,
      SUBSTR(UPPER(lexeme),1,2) AS first_two,
      COUNT(*) AS wc,
      COUNT(UNIQUE(SUBSTR(UPPER(lexeme),3,1))) AS tuple3
   FROM work_wordlist
      WHERE SUBSTR(UPPER(lexeme),1,1) NOT IN (‘Q’,’Z’,’Y’,’X’)
   GROUP BY SUBSTR(UPPER(lexeme),1,2)
      HAVING COUNT(*) > 20

w h i t e s p a c e

                                                                                                                                             view query results 

The query output is alphabetically sorted; take a look at the entries for ‘CO’ and ‘EF’ to better understand. 2579 English words begin with ‘CO’, and fully 25 different letters occur as the 3rd letter within this wordset. Of all the English letters, only the letter ‘J’ does not appear as a 3rd letter. Contrast this with ‘EF’ which begins only 59 words and permits exactly 1 possible 3rd letter, namely another ‘F’. So, the first two-letter sequence is highly malleable in English spelling, whereas the latter sequence is rigid and determinating. We could express this visually as a bright hue contrasted with a dark hue, and of course many grades of malleability would exist in between these two poles.

R Method map.market()

Having exported the CSV file from this query into the R working directory, (‘first_2letters.csv’), let’s inspect the R code needed to generate the tree map. Note that we’ll need to load a non-standard package into our local library called ‘portfolio’ to draw the tree map. The function called is map.market(), which was originally designed for stock market visuals.

                                                                                                               view map.market documentation 

w h i t e s p a c e

   # tree map needs external library package — only necessary once
   # load package into R memory for present workload — necessary each time
   # grab the data from CSV into R data frame
   df_eng2 <- read.csv("english_first2letters.csv", header=TRUE, sep=",",
   +            as.is=TRUE)
   # execute tree map function call
   map.market(id=df_eng2$FIRST_TWO, area=df_eng2$WC, group=df_eng2$FIRST,
   +            color=df_eng2$TUPLE3, main="English Word Beginnings",
   +            lab=c(TRUE, FALSE) )

w h i t e s p a c e

It’s worthwhile taking a moment to clarify the important arguments for map.market():

         • id=             ~points to scalar of label values for smaller-grain rectangles
         • area=         ~points to scalar of quantity to indicate rectangle sizes
         • group=       ~points to scalar identifying group to which each smaller-grain entry belongs
         • color=        ~which value to indicate via hue intensity for smaller-grain rectangles
         • lab=           ~Boolean vector of 2 values indicating if labels should display for group, id rectangles

Here’s the R output:
w h i t e s p a c e

treemap with first letters labelled

treemap with first letters labelled

w h i t e s p a c e
treemap with first two letters labelled

treemap with first two letters labelled

Just for kicks, I’ve reproduced the same graph twice. Labels have been assigned to the larger rectangles in the first case and to the smaller rectangles in the second.. This was accomplished by flipping the value vector within the lab argument: lab=(FALSE, TRUE). As you can see, tree map #2 needs a little cleanup. The function code can only reduce the fonts within certain limits. To make an interactive tree map with clickable components and mouse-over enlargements would require Java code and is beyond the present scope.
w h i t e s p a c e

R Nerd Addendum

This isn’t really an R blog, rather an Oracle blog. But since I came across an idiosyncratic R error, which looks fairly Martian at first glance, I thought I’d report on it. The error in question, while executing the map.market() function, was:

Error: all(!is.na(id)) is not TRUE

This happened because of the innocent looking 121st entry within the $FIRST_TWO vector, which is “NA”. In R, “NA” means roughly the same thing which NULL means to Oracle SQL. Strangely, this is true even when R encounters “NA” inside of a CSV file (within double quotes!). So, the datapoint for the English words starting with the letters ‘NA’ were being misinterpreted by R. After playing around a little, I was able to bypass this problem by explicitly reassigning the value inside of R. But noticing the problem, and solving it, were not obvious, so I’ve included the relevant R session below, with annotations, in case anyone else is afflicted by this. R bears some resemblance to APL syntactically, and I recall these same kind of obscure errors back in my university days with that language.

                                                                                                                                        view R error session  

Funkier Tree Map Variations

NY Times circular tree map showing household spending shifts

NY Times circular tree map: household spending shifts

The hierarchical quality of many datasets can be more complex, containing nested levels of detail. The de rigeur method for handling this layered aspect of data presentation is with the use of zooming techniques. Normally these would require more code. The example illustrated at right is a circular tree map, taken from the New York Times. It depicts average U.S. consumer spending during a 12-month period between the spring of 2007 and 2008. The link allows you to mouse over every small spending fragment to see a pop-up with the relevant data. This was a particularly volatile period for the U.S. consumer economy: the coloration changes depict percentage shifts in costs during the year, with the darker reddish hue indicating an increase of up to 40% (think eggs and oil)! This PDF describes something called a clock map, with similar zoomability. Finally, here’s an abstract from a French research project laying out the principles for effecting user navigation of complex layered data with hundreds of thousands of elements. I’ve not yet found an R library which can handle these kinds of zoom maps, but let me know if you’ve got one. Likely, the main issue is with the detail pop-ups and mouse sensing — factors which make other tools such as JavaScript more suitable.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: